linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] support large folio for mlock
@ 2023-07-28  7:09 Yin Fengwei
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-07-28  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd
  Cc: fengwei.yin

Yu mentioned at [1] about the mlock() can't be applied to large folio.

I leant the related code and here is my understanding:
- For RLIMIT_MEMLOCK related, there is no problem. Becuase the
  RLIMIT_MEMLOCK statistics is not related underneath page. That means
  underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
  statistics collection which is always correct.

- For keeping the page in RAM, there is no problem either. At least,
  during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
  set in vm_flags, the folio will be kept whatever the folio is
  mlocked or not.

So the function of mlock for large folio works. But it's not optimized
because the page reclaim needs scan these large folio and may split
them.

This series identified the large folio for mlock to two types:
  - The large folio is in VM_LOCKED VMA range

  - The large folio cross VM_LOCKED VMA boundary

For the first type, we mlock large folio so page relcaim will skip it.
For the second type, we don't mlock large folio. It's allowed to be
picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
range are allowed to be reclaimed/released.

patch1 introduce API to check whether large folio is in VMA range.
patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
       large folio mlock/munlock.
patch3 make mlock/munlock syscall support large folio.

testing done:
  - kernel selftest. No extra failure introduced

RFC v2 was post here [2].

Yu also mentioned a race which can make folio unevictable after munlock
during RFC v2 discussion [3]:
We decided that race issue didn't block this series based on:
  - That race issue was not introduced by this sereis

  - We had a looks-ok fix for that race issue. Need to wait
    for mlock_count fixing patch as Yosry Ahmed suggested [4]

ChangeLog from RFC v2:
  - Removed RFC

  - dropped folio_is_large() check as suggested by both Yu and Huge

  - Besides the address/pgoff check, also check the page table
    entry when check whether the folio is in the range. This is
    to handle mremap case that address/pgoff is in range, but
    folio can't be identified as in range.

  - Fixed one issue in page_add_anon_rmap() and page_add_anon_rmap()
    introdued by RFC v2. As these two functions can be called multiple
    times against one folio. And remove_rmap() may not be called same
    times. Which can bring imbalanced mlock_count. Fix it by skip
    mlock large folio in these two functions.

[1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
[3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/
[4] https://lore.kernel.org/linux-mm/CAJD7tkZJFG=7xs=9otc5CKs6odWu48daUuZP9Wd9Z-sZF07hXg@mail.gmail.com/

Yin Fengwei (3):
  mm: add functions folio_in_range() and folio_within_vma()
  mm: handle large folio when large folio in VM_LOCKED VMA range
  mm: mlock: update mlock_pte_range to handle large folio

 mm/internal.h | 87 +++++++++++++++++++++++++++++++++++++++++++++------
 mm/mlock.c    | 57 +++++++++++++++++++++++++++++++--
 mm/rmap.c     | 27 +++++++++++-----
 3 files changed, 153 insertions(+), 18 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28  7:09 [PATCH 0/3] support large folio for mlock Yin Fengwei
@ 2023-07-28  7:09 ` Yin Fengwei
  2023-07-28 18:34   ` Andrew Morton
                     ` (3 more replies)
  2023-07-28  7:09 ` [PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
  2023-07-28  7:09 ` [PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
  2 siblings, 4 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-07-28  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd
  Cc: fengwei.yin

It will be used to check whether the folio is mapped to specific
VMA and whether the mapping address of folio is in the range.

Also a helper function folio_within_vma() to check whether folio
is in the range of vma based on folio_in_range().

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 5a03bc4782a2..63de32154a48 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
 			       unsigned long bytes);
+
+/*
+ * Check whether the folio is in specific range
+ *
+ * First, check whether the folio is in the range of vma.
+ * Then, check whether the folio is mapped to the range of [start, end].
+ * In the end, check whether the folio is fully mapped to the range.
+ *
+ * @pte page table pointer will be checked whether the large folio
+ *      is fully mapped to. Currently, if mremap in the middle of
+ *      large folio, the large folio could be mapped to to different
+ *      VMA and address check can't identify this situation.
+ */
+static inline bool
+folio_in_range(struct folio *folio, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, pte_t *pte)
+{
+	pte_t ptent;
+	unsigned long i, nr = folio_nr_pages(folio);
+	pgoff_t pgoff, addr;
+	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
+
+	if (start < vma->vm_start)
+		start = vma->vm_start;
+	if (end > vma->vm_end)
+		end = vma->vm_end;
+
+	pgoff = folio_pgoff(folio);
+	/* if folio start address is not in vma range */
+	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
+		return false;
+
+	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	if (addr < start || end - addr < folio_size(folio))
+		return false;
+
+	/* not necessary to check pte for none large folio */
+	if (!folio_test_large(folio))
+		return true;
+
+	if (!pte)
+		return false;
+
+	/* check whether parameter pte is associated with folio */
+	ptent = ptep_get(pte);
+	if (pte_none(ptent) || !pte_present(ptent) ||
+			pte_pfn(ptent) - folio_pfn(folio) >= nr)
+		return false;
+
+	pte -= pte_pfn(ptent) - folio_pfn(folio);
+	for (i = 0; i < nr; i++, pte++) {
+		ptent = ptep_get(pte);
+
+		if (pte_none(ptent) || !pte_present(ptent) ||
+				pte_pfn(ptent) - folio_pfn(folio) >= nr)
+			return false;
+	}
+
+	return true;
+}
+
+static inline bool
+folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
+{
+	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
+}
+
 /*
  * mlock_vma_folio() and munlock_vma_folio():
  * should be called with vma's mmap_lock held for read or write,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range
  2023-07-28  7:09 [PATCH 0/3] support large folio for mlock Yin Fengwei
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
@ 2023-07-28  7:09 ` Yin Fengwei
  2023-07-28  7:09 ` [PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
  2 siblings, 0 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-07-28  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd
  Cc: fengwei.yin

If large folio is in the range of VM_LOCKED VMA, it should be
mlocked to avoid being picked by page reclaim. Which may split
the large folio and then mlock each pages again.

Mlock this kind of large folio to prevent them being picked by
page reclaim.

For the large folio which cross the boundary of VM_LOCKED VMA,
we'd better not to mlock it. So if the system is under memory
pressure, this kind of large folio will be split and the pages
ouf of VM_LOCKED VMA can be reclaimed.

for page_add_anon_rmap() and page_add_file_rmap(), we only mlock
the folio if it's not large folio. The reason to do so is that
these functions can be called couple of times for a large folio
and each call just covered piece of large folio. If folio is
mlocked multiple time, the folio->mlock_count can be imbalance.
Delay the folio mlock to page reclaim phase. As only mlock folio
once for sure in page reclaim phase.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/internal.h | 18 +++++++++---------
 mm/rmap.c     | 27 ++++++++++++++++++++-------
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 63de32154a48..6c6fb1f3e4c1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,14 +662,10 @@ folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
  * mlock is usually called at the end of page_add_*_rmap(), munlock at
  * the end of page_remove_rmap(); but new anon folios are managed by
  * folio_add_lru_vma() calling mlock_new_folio().
- *
- * @compound is used to include pmd mappings of THPs, but filter out
- * pte mappings of THPs, which cannot be consistently counted: a pte
- * mapping of the THP head cannot be distinguished by the page alone.
  */
 void mlock_folio(struct folio *folio);
 static inline void mlock_vma_folio(struct folio *folio,
-			struct vm_area_struct *vma, bool compound)
+			struct vm_area_struct *vma, pte_t *pte)
 {
 	/*
 	 * The VM_SPECIAL check here serves two purposes.
@@ -680,16 +676,20 @@ static inline void mlock_vma_folio(struct folio *folio,
 	 *    still be set while VM_SPECIAL bits are added: so ignore it then.
 	 */
 	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
-	    (compound || !folio_test_large(folio)))
+	    folio_within_vma(folio, vma, pte))
 		mlock_folio(folio);
 }
 
 void munlock_folio(struct folio *folio);
 static inline void munlock_vma_folio(struct folio *folio,
-			struct vm_area_struct *vma, bool compound)
+					struct vm_area_struct *vma)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
-	    (compound || !folio_test_large(folio)))
+	/*
+	 * To handle the case that a mlocked large folio is unmapped from VMA
+	 * piece by piece, allow munlock the large folio which is partially
+	 * mapped to VMA.
+	 */
+	if (unlikely(vma->vm_flags & VM_LOCKED))
 		munlock_folio(folio);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 54124f18e0e4..1d8f048fbed8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,7 @@ struct folio_referenced_arg {
 	unsigned long vm_flags;
 	struct mem_cgroup *memcg;
 };
+
 /*
  * arg: folio_referenced_arg will be passed
  */
@@ -811,10 +812,22 @@ static bool folio_referenced_one(struct folio *folio,
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if ((vma->vm_flags & VM_LOCKED) &&
-		    (!folio_test_large(folio) || !pvmw.pte)) {
+		if (vma->vm_flags & VM_LOCKED) {
+			if (!folio_within_vma(folio, vma, pvmw.pte)) {
+				/*
+				 * For large folio cross VMA boundaries, it's
+				 * expected to be picked  by page reclaim. But
+				 * should skip reference of pages which are in
+				 * the range of VM_LOCKED vma. As page reclaim
+				 * should just count the reference of pages out
+				 * the range of VM_LOCKED vma.
+				 */
+				pra->mapcount--;
+				continue;
+			}
+
 			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma, !pvmw.pte);
+			mlock_vma_folio(folio, vma, pvmw.pte);
 			page_vma_mapped_walk_done(&pvmw);
 			pra->vm_flags |= VM_LOCKED;
 			return false; /* To break the loop */
@@ -1253,7 +1266,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 			__page_check_anon_rmap(folio, page, vma, address);
 	}
 
-	mlock_vma_folio(folio, vma, compound);
+	mlock_vma_folio(folio, vma, NULL);
 }
 
 /**
@@ -1344,7 +1357,7 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	if (nr)
 		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
 
-	mlock_vma_folio(folio, vma, compound);
+	mlock_vma_folio(folio, vma, NULL);
 }
 
 /**
@@ -1383,7 +1396,7 @@ static void __remove_rmap_finish(struct folio *folio,
 	 * it's only reliable while mapped.
 	 */
 
-	munlock_vma_folio(folio, vma, compound);
+	munlock_vma_folio(folio, vma);
 }
 
 /**
@@ -1557,7 +1570,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
 			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma, false);
+			mlock_vma_folio(folio, vma, pvmw.pte);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio
  2023-07-28  7:09 [PATCH 0/3] support large folio for mlock Yin Fengwei
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
  2023-07-28  7:09 ` [PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
@ 2023-07-28  7:09 ` Yin Fengwei
  2 siblings, 0 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-07-28  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd
  Cc: fengwei.yin

Current kernel only lock base size folio during mlock syscall.
Add large folio support with following rules:
  - Only mlock large folio when it's in VM_LOCKED VMA range
    and fully mapped to page table.

    fully mapped folio is required to handle the case that
    mremap happens in the middle of large folio and split
    pages of large folio to two different VMA.

  - munlock will apply to the large folio which is in VMA range
    or cross the VMA boundary.

    This is required to handle the case that the large folio is
    mlocked, later the VMA is split in the middle of large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/mlock.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c21..8056f9176b70 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -305,6 +305,50 @@ void munlock_folio(struct folio *folio)
 	local_unlock(&mlock_fbatch.lock);
 }
 
+static inline unsigned int folio_mlock_step(struct folio *folio,
+		pte_t *pte, unsigned long addr, unsigned long end)
+{
+	unsigned int count, i, nr = folio_nr_pages(folio);
+	unsigned long pfn = folio_pfn(folio);
+	pte_t ptent = ptep_get(pte);
+
+	if (!folio_test_large(folio))
+		return 1;
+
+	count = pfn + nr - pte_pfn(ptent);
+	count = min_t(unsigned int, count, (end - addr) >> PAGE_SHIFT);
+
+	if (!pte)
+		return count;
+
+	for (i = 0; i < count; i++, pte++) {
+		pte_t entry = ptep_get(pte);
+
+		if (pte_none(entry) || !pte_present(entry))
+			break;
+		if (pte_pfn(entry) - pfn >= nr)
+			break;
+	}
+
+	return i;
+}
+
+static inline bool should_mlock_folio(struct folio *folio,
+				struct vm_area_struct *vma, pte_t *pte)
+{
+	/*
+	 * For unlock, allow munlock large folio which is partially
+	 * mapped to VMA. As it's possible that large folio is
+	 * mlocked and VMA is split later.
+	 *
+	 * During memory pressure, such kind of large folio can
+	 * be split. And the pages are not in VM_LOCKed VMA
+	 * can be reclaimed.
+	 */
+	return !(vma->vm_flags & VM_LOCKED) ||
+			folio_within_vma(folio, vma, pte);
+}
+
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 			   unsigned long end, struct mm_walk *walk)
 
@@ -314,6 +358,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *start_pte, *pte;
 	pte_t ptent;
 	struct folio *folio;
+	unsigned int step = 1;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -334,6 +379,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
+
 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = ptep_get(pte);
 		if (!pte_present(ptent))
@@ -341,12 +387,19 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		folio = vm_normal_folio(vma, addr, ptent);
 		if (!folio || folio_is_zone_device(folio))
 			continue;
-		if (folio_test_large(folio))
-			continue;
+
+		step = folio_mlock_step(folio, pte, addr, end);
+		if (!should_mlock_folio(folio, vma, pte))
+			goto next_entry;
+
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_folio(folio);
 		else
 			munlock_folio(folio);
+
+next_entry:
+		pte += step - 1;
+		addr += (step - 1) << PAGE_SHIFT;
 	}
 	pte_unmap(start_pte);
 out:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
@ 2023-07-28 18:34   ` Andrew Morton
  2023-07-29 13:54     ` Yin, Fengwei
  2023-08-02 11:14   ` Ryan Roberts
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2023-07-28 18:34 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, linux-kernel, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd

On Fri, 28 Jul 2023 15:09:27 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
> 
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
> 
> ...
>
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>  				   bool write, int *locked);
>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>  			       unsigned long bytes);
> +
> +/*
> + * Check whether the folio is in specific range
> + *
> + * First, check whether the folio is in the range of vma.
> + * Then, check whether the folio is mapped to the range of [start, end].
> + * In the end, check whether the folio is fully mapped to the range.
> + *
> + * @pte page table pointer will be checked whether the large folio
> + *      is fully mapped to. Currently, if mremap in the middle of
> + *      large folio, the large folio could be mapped to to different
> + *      VMA and address check can't identify this situation.
> + */
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, pte_t *pte)
> +{

This looks crazily way too large to be inlined?

> +	pte_t ptent;
> +	unsigned long i, nr = folio_nr_pages(folio);
> +	pgoff_t pgoff, addr;
> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> +
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	pgoff = folio_pgoff(folio);
> +	/* if folio start address is not in vma range */
> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
> +		return false;
> +
> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	if (addr < start || end - addr < folio_size(folio))
> +		return false;
> +
> +	/* not necessary to check pte for none large folio */
> +	if (!folio_test_large(folio))
> +		return true;
> +
> +	if (!pte)
> +		return false;
> +
> +	/* check whether parameter pte is associated with folio */
> +	ptent = ptep_get(pte);
> +	if (pte_none(ptent) || !pte_present(ptent) ||
> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +		return false;
> +
> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
> +	for (i = 0; i < nr; i++, pte++) {
> +		ptent = ptep_get(pte);
> +
> +		if (pte_none(ptent) || !pte_present(ptent) ||
> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline bool
> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
> +{
> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
> +}

And this doesn't.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28 18:34   ` Andrew Morton
@ 2023-07-29 13:54     ` Yin, Fengwei
  0 siblings, 0 replies; 26+ messages in thread
From: Yin, Fengwei @ 2023-07-29 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, yuzhao, willy, david, ryan.roberts,
	shy828301, hughd



On 7/29/2023 2:34 AM, Andrew Morton wrote:
> On Fri, 28 Jul 2023 15:09:27 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> ...
>>
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
>> +/*
>> + * Check whether the folio is in specific range
>> + *
>> + * First, check whether the folio is in the range of vma.
>> + * Then, check whether the folio is mapped to the range of [start, end].
>> + * In the end, check whether the folio is fully mapped to the range.
>> + *
>> + * @pte page table pointer will be checked whether the large folio
>> + *      is fully mapped to. Currently, if mremap in the middle of
>> + *      large folio, the large folio could be mapped to to different
>> + *      VMA and address check can't identify this situation.
>> + */
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end, pte_t *pte)
>> +{
> 
> This looks crazily way too large to be inlined?
Oops. Yes. You are right. I will make it not inline in next version. Thanks.


Regards
Yin, Fengwei

> 
>> +	pte_t ptent;
>> +	unsigned long i, nr = folio_nr_pages(folio);
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +	/* if folio start address is not in vma range */
>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>> +		return false;
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	if (addr < start || end - addr < folio_size(folio))
>> +		return false;
>> +
>> +	/* not necessary to check pte for none large folio */
>> +	if (!folio_test_large(folio))
>> +		return true;
>> +
>> +	if (!pte)
>> +		return false;
>> +
>> +	/* check whether parameter pte is associated with folio */
>> +	ptent = ptep_get(pte);
>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +		return false;
>> +
>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>> +	for (i = 0; i < nr; i++, pte++) {
>> +		ptent = ptep_get(pte);
>> +
>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +			return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>> +}
> 
> And this doesn't.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
  2023-07-28 18:34   ` Andrew Morton
@ 2023-08-02 11:14   ` Ryan Roberts
  2023-08-02 11:35     ` Ryan Roberts
  2023-08-02 12:50     ` Yin, Fengwei
  2023-08-02 15:15   ` Ryan Roberts
  2023-08-03  9:58   ` Ryan Roberts
  3 siblings, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 11:14 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 28/07/2023 08:09, Yin Fengwei wrote:
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
> 
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a03bc4782a2..63de32154a48 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>  				   bool write, int *locked);
>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>  			       unsigned long bytes);
> +
> +/*
> + * Check whether the folio is in specific range
> + *
> + * First, check whether the folio is in the range of vma.
> + * Then, check whether the folio is mapped to the range of [start, end].
> + * In the end, check whether the folio is fully mapped to the range.
> + *
> + * @pte page table pointer will be checked whether the large folio
> + *      is fully mapped to. Currently, if mremap in the middle of
> + *      large folio, the large folio could be mapped to to different
> + *      VMA and address check can't identify this situation.
> + */
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, pte_t *pte)

This api seems a bit redundant to me. Wouldn't it be better to remove the vma
parameter and instead fix up the start/end addresses in folio_within_vma()?

> +{
> +	pte_t ptent;
> +	unsigned long i, nr = folio_nr_pages(folio);
> +	pgoff_t pgoff, addr;
> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> +
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	pgoff = folio_pgoff(folio);
> +	/* if folio start address is not in vma range */
> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
> +		return false;
> +
> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	if (addr < start || end - addr < folio_size(folio))
> +		return false;
> +
> +	/* not necessary to check pte for none large folio */
> +	if (!folio_test_large(folio))
> +		return true;
> +
> +	if (!pte)
> +		return false;
> +
> +	/* check whether parameter pte is associated with folio */
> +	ptent = ptep_get(pte);
> +	if (pte_none(ptent) || !pte_present(ptent) ||
> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +		return false;
> +
> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
> +	for (i = 0; i < nr; i++, pte++) {
> +		ptent = ptep_get(pte);
> +
> +		if (pte_none(ptent) || !pte_present(ptent) ||
> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +			return false;
> +	}

I don't think I see anything to ensure you don't wander off the end (or start)
of the pgtable? If the folio is mremapped so that it straddles multiple tables
(or is bigger than a single table?) then I think pte can become invalid? Perhaps
you intended start/end to always be within the same pgtable, but that is not
guarranteed in the case that folio_within_vma() is making the call.

Also I want to check that this function is definitely always called under the
PTL for the table that pte belongs to?

> +
> +	return true;
> +}
> +
> +static inline bool
> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
> +{
> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
> +}
> +
>  /*
>   * mlock_vma_folio() and munlock_vma_folio():
>   * should be called with vma's mmap_lock held for read or write,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 11:14   ` Ryan Roberts
@ 2023-08-02 11:35     ` Ryan Roberts
  2023-08-02 13:12       ` Yin, Fengwei
  2023-08-02 12:50     ` Yin, Fengwei
  1 sibling, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 11:35 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 02/08/2023 12:14, Ryan Roberts wrote:
> On 28/07/2023 08:09, Yin Fengwei wrote:
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 69 insertions(+)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 5a03bc4782a2..63de32154a48 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
>> +/*
>> + * Check whether the folio is in specific range
>> + *
>> + * First, check whether the folio is in the range of vma.
>> + * Then, check whether the folio is mapped to the range of [start, end].
>> + * In the end, check whether the folio is fully mapped to the range.
>> + *
>> + * @pte page table pointer will be checked whether the large folio
>> + *      is fully mapped to. Currently, if mremap in the middle of
>> + *      large folio, the large folio could be mapped to to different
>> + *      VMA and address check can't identify this situation.
>> + */
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end, pte_t *pte)
> 
> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
> parameter and instead fix up the start/end addresses in folio_within_vma()?

I have created a function as part of my "pte batch-zap" patch set [1], which
counts the number of contiguously mapped pages of a folio
(folio_nr_pages_cont_mapped()). I wonder if actually this should be the
primitive, which can be shared for more cases. Then your folio_within_vma()
function could just compare the nr_pages to folio_nr_pages() to decide if the
folio is fully and contiguously mapped in the VMA.

I also wonder if you should change the name of folio_within_vma() to something
like folio_test_cont_in_vma() to disambiguate from the case where the folio may
be fully mapped with a discontiguity (although perhaps that's not possible
because a mremap would result in distinct vmas... would a new mmap in the hole
cause a merge of all 3?).

[1] https://lore.kernel.org/linux-mm/20230727141837.3386072-4-ryan.roberts@arm.com/

> 
>> +{
>> +	pte_t ptent;
>> +	unsigned long i, nr = folio_nr_pages(folio);
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +	/* if folio start address is not in vma range */
>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>> +		return false;
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	if (addr < start || end - addr < folio_size(folio))
>> +		return false;
>> +
>> +	/* not necessary to check pte for none large folio */
>> +	if (!folio_test_large(folio))
>> +		return true;
>> +
>> +	if (!pte)
>> +		return false;
>> +
>> +	/* check whether parameter pte is associated with folio */
>> +	ptent = ptep_get(pte);
>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +		return false;
>> +
>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>> +	for (i = 0; i < nr; i++, pte++) {
>> +		ptent = ptep_get(pte);
>> +
>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +			return false;
>> +	}
> 
> I don't think I see anything to ensure you don't wander off the end (or start)
> of the pgtable? If the folio is mremapped so that it straddles multiple tables
> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
> you intended start/end to always be within the same pgtable, but that is not
> guarranteed in the case that folio_within_vma() is making the call.
> 
> Also I want to check that this function is definitely always called under the
> PTL for the table that pte belongs to?
> 
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>> +}
>> +
>>  /*
>>   * mlock_vma_folio() and munlock_vma_folio():
>>   * should be called with vma's mmap_lock held for read or write,
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 11:14   ` Ryan Roberts
  2023-08-02 11:35     ` Ryan Roberts
@ 2023-08-02 12:50     ` Yin, Fengwei
  2023-08-02 13:09       ` Ryan Roberts
  1 sibling, 1 reply; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-02 12:50 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/2023 7:14 PM, Ryan Roberts wrote:
> On 28/07/2023 08:09, Yin Fengwei wrote:
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 69 insertions(+)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 5a03bc4782a2..63de32154a48 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
>> +/*
>> + * Check whether the folio is in specific range
>> + *
>> + * First, check whether the folio is in the range of vma.
>> + * Then, check whether the folio is mapped to the range of [start, end].
>> + * In the end, check whether the folio is fully mapped to the range.
>> + *
>> + * @pte page table pointer will be checked whether the large folio
>> + *      is fully mapped to. Currently, if mremap in the middle of
>> + *      large folio, the large folio could be mapped to to different
>> + *      VMA and address check can't identify this situation.
>> + */
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end, pte_t *pte)
> 
> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
> parameter and instead fix up the start/end addresses in folio_within_vma()?
My understanding is it's necessary. As for madvise, we need to check whether
the folio is both in the range of VMA and also in the range of [start, end).

> 
>> +{
>> +	pte_t ptent;
>> +	unsigned long i, nr = folio_nr_pages(folio);
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +	/* if folio start address is not in vma range */
>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>> +		return false;
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	if (addr < start || end - addr < folio_size(folio))
>> +		return false;
>> +
>> +	/* not necessary to check pte for none large folio */
>> +	if (!folio_test_large(folio))
>> +		return true;
>> +
>> +	if (!pte)
>> +		return false;
>> +
>> +	/* check whether parameter pte is associated with folio */
>> +	ptent = ptep_get(pte);
>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +		return false;
>> +
>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>> +	for (i = 0; i < nr; i++, pte++) {
>> +		ptent = ptep_get(pte);
>> +
>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +			return false;
>> +	}
> 
> I don't think I see anything to ensure you don't wander off the end (or start)
> of the pgtable? If the folio is mremapped so that it straddles multiple tables
> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
> you intended start/end to always be within the same pgtable, but that is not
> guarranteed in the case that folio_within_vma() is making the call.
If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
function just return false in page table entry check phase.

> 
> Also I want to check that this function is definitely always called under the
> PTL for the table that pte belongs to?
Yes. I should spell it out. Thanks.


Regards
Yin, Fengwei

> 
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>> +}
>> +
>>  /*
>>   * mlock_vma_folio() and munlock_vma_folio():
>>   * should be called with vma's mmap_lock held for read or write,
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 12:50     ` Yin, Fengwei
@ 2023-08-02 13:09       ` Ryan Roberts
  2023-08-02 13:46         ` Yin, Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 13:09 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 02/08/2023 13:50, Yin, Fengwei wrote:
> 
> 
> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>> It will be used to check whether the folio is mapped to specific
>>> VMA and whether the mapping address of folio is in the range.
>>>
>>> Also a helper function folio_within_vma() to check whether folio
>>> is in the range of vma based on folio_in_range().
>>>
>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 69 insertions(+)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 5a03bc4782a2..63de32154a48 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>  				   bool write, int *locked);
>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>  			       unsigned long bytes);
>>> +
>>> +/*
>>> + * Check whether the folio is in specific range
>>> + *
>>> + * First, check whether the folio is in the range of vma.
>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>> + * In the end, check whether the folio is fully mapped to the range.
>>> + *
>>> + * @pte page table pointer will be checked whether the large folio
>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>> + *      large folio, the large folio could be mapped to to different
>>> + *      VMA and address check can't identify this situation.
>>> + */
>>> +static inline bool
>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>
>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>> parameter and instead fix up the start/end addresses in folio_within_vma()?
> My understanding is it's necessary. As for madvise, we need to check whether
> the folio is both in the range of VMA and also in the range of [start, end).

But in folio_within_vma() you pass start as vma->vm_start and end as
vma->vm_end. And in this function, you narrow start/end to be completely
contained in vma. So surely there is only really one start/end you are
interested in? Just seems a bit odd to me.

> 
>>
>>> +{
>>> +	pte_t ptent;
>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>> +	pgoff_t pgoff, addr;
>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>> +
>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>> +
>>> +	if (start < vma->vm_start)
>>> +		start = vma->vm_start;
>>> +	if (end > vma->vm_end)
>>> +		end = vma->vm_end;
>>> +
>>> +	pgoff = folio_pgoff(folio);
>>> +	/* if folio start address is not in vma range */
>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>> +		return false;
>>> +
>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>> +	if (addr < start || end - addr < folio_size(folio))
>>> +		return false;
>>> +
>>> +	/* not necessary to check pte for none large folio */
>>> +	if (!folio_test_large(folio))
>>> +		return true;
>>> +
>>> +	if (!pte)
>>> +		return false;
>>> +
>>> +	/* check whether parameter pte is associated with folio */
>>> +	ptent = ptep_get(pte);
>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>> +		return false;
>>> +
>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>> +	for (i = 0; i < nr; i++, pte++) {
>>> +		ptent = ptep_get(pte);
>>> +
>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>> +			return false;
>>> +	}
>>
>> I don't think I see anything to ensure you don't wander off the end (or start)
>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>> you intended start/end to always be within the same pgtable, but that is not
>> guarranteed in the case that folio_within_vma() is making the call.
> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
> function just return false in page table entry check phase.

Sorry I don't think this covers the issue I'm describing. If you have a
pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
problem?

example for 4K base page set up:

folio_nr_pages = 512
first page of folio mapped at vaddr = 2M - 4K = 0x1FF000

If you then call this function with the pte pointer for the second page in the
folio, which is mapped at address 0x200000, that pte is pointing to the first
pte entry in the table pointed to by the second pmd entry. The pte pointer can
be legitimately manipulated to point to any entry within that table,
corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
from the pointer, intending that it now points to the pte entry that represents
vaddr 0x1FF000. But actually it has fallen off the front of the table into some
other arbitrary memory in the linear map. 0x1FF000 is represented in a different
table, pointed to by the first pmd entry.


> 
>>
>> Also I want to check that this function is definitely always called under the
>> PTL for the table that pte belongs to?
> Yes. I should spell it out. Thanks.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static inline bool
>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>> +{
>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>> +}
>>> +
>>>  /*
>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>   * should be called with vma's mmap_lock held for read or write,
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 11:35     ` Ryan Roberts
@ 2023-08-02 13:12       ` Yin, Fengwei
  2023-08-02 15:12         ` Ryan Roberts
  0 siblings, 1 reply; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-02 13:12 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/2023 7:35 PM, Ryan Roberts wrote:
> On 02/08/2023 12:14, Ryan Roberts wrote:
>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>> It will be used to check whether the folio is mapped to specific
>>> VMA and whether the mapping address of folio is in the range.
>>>
>>> Also a helper function folio_within_vma() to check whether folio
>>> is in the range of vma based on folio_in_range().
>>>
>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 69 insertions(+)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 5a03bc4782a2..63de32154a48 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>  				   bool write, int *locked);
>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>  			       unsigned long bytes);
>>> +
>>> +/*
>>> + * Check whether the folio is in specific range
>>> + *
>>> + * First, check whether the folio is in the range of vma.
>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>> + * In the end, check whether the folio is fully mapped to the range.
>>> + *
>>> + * @pte page table pointer will be checked whether the large folio
>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>> + *      large folio, the large folio could be mapped to to different
>>> + *      VMA and address check can't identify this situation.
>>> + */
>>> +static inline bool
>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>
>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>> parameter and instead fix up the start/end addresses in folio_within_vma()?
> 
> I have created a function as part of my "pte batch-zap" patch set [1], which
> counts the number of contiguously mapped pages of a folio
> (folio_nr_pages_cont_mapped()). I wonder if actually this should be the
> primitive, which can be shared for more cases. Then your folio_within_vma()
> function could just compare the nr_pages to folio_nr_pages() to decide if the
> folio is fully and contiguously mapped in the VMA.
That means we need to unify the parameters. But I don't care about the page and
you don't care about the VMA. :). Maybe we can share the PTE check part?

> 
> I also wonder if you should change the name of folio_within_vma() to something
> like folio_test_cont_in_vma() to disambiguate from the case where the folio may
> be fully mapped with a discontiguity (although perhaps that's not possible
> because a mremap would result in distinct vmas... would a new mmap in the hole
> cause a merge of all 3?).
I don't think it's possible as mremap reuse original pgoff of VMA to new VMA. I suppose
it will prevent VMA merging. But I didn't check detail.

I hate to add the PTE check as it makes folio_within_vma() much heavy and can only
be called with page table holding. But MREMAP_DONTUNMAP could create the VMA which
just has part of folio mapped. And the first version folio_within_vma() can't identify
it.


Regards
Yin, Fengwei

> 
> [1] https://lore.kernel.org/linux-mm/20230727141837.3386072-4-ryan.roberts@arm.com/
> 
>>
>>> +{
>>> +	pte_t ptent;
>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>> +	pgoff_t pgoff, addr;
>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>> +
>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>> +
>>> +	if (start < vma->vm_start)
>>> +		start = vma->vm_start;
>>> +	if (end > vma->vm_end)
>>> +		end = vma->vm_end;
>>> +
>>> +	pgoff = folio_pgoff(folio);
>>> +	/* if folio start address is not in vma range */
>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>> +		return false;
>>> +
>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>> +	if (addr < start || end - addr < folio_size(folio))
>>> +		return false;
>>> +
>>> +	/* not necessary to check pte for none large folio */
>>> +	if (!folio_test_large(folio))
>>> +		return true;
>>> +
>>> +	if (!pte)
>>> +		return false;
>>> +
>>> +	/* check whether parameter pte is associated with folio */
>>> +	ptent = ptep_get(pte);
>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>> +		return false;
>>> +
>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>> +	for (i = 0; i < nr; i++, pte++) {
>>> +		ptent = ptep_get(pte);
>>> +
>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>> +			return false;
>>> +	}
>>
>> I don't think I see anything to ensure you don't wander off the end (or start)
>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>> you intended start/end to always be within the same pgtable, but that is not
>> guarranteed in the case that folio_within_vma() is making the call.
>>
>> Also I want to check that this function is definitely always called under the
>> PTL for the table that pte belongs to?
>>
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static inline bool
>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>> +{
>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>> +}
>>> +
>>>  /*
>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>   * should be called with vma's mmap_lock held for read or write,
>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 13:09       ` Ryan Roberts
@ 2023-08-02 13:46         ` Yin, Fengwei
  2023-08-02 14:08           ` Ryan Roberts
  0 siblings, 1 reply; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-02 13:46 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/2023 9:09 PM, Ryan Roberts wrote:
> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>> It will be used to check whether the folio is mapped to specific
>>>> VMA and whether the mapping address of folio is in the range.
>>>>
>>>> Also a helper function folio_within_vma() to check whether folio
>>>> is in the range of vma based on folio_in_range().
>>>>
>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 69 insertions(+)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 5a03bc4782a2..63de32154a48 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>  				   bool write, int *locked);
>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>  			       unsigned long bytes);
>>>> +
>>>> +/*
>>>> + * Check whether the folio is in specific range
>>>> + *
>>>> + * First, check whether the folio is in the range of vma.
>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>> + *
>>>> + * @pte page table pointer will be checked whether the large folio
>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>> + *      large folio, the large folio could be mapped to to different
>>>> + *      VMA and address check can't identify this situation.
>>>> + */
>>>> +static inline bool
>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>
>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>> My understanding is it's necessary. As for madvise, we need to check whether
>> the folio is both in the range of VMA and also in the range of [start, end).
> 
> But in folio_within_vma() you pass start as vma->vm_start and end as
> vma->vm_end. And in this function, you narrow start/end to be completely
> contained in vma. So surely there is only really one start/end you are
> interested in? Just seems a bit odd to me.
madvise() will call filio_in_range() with VMA and real range [start, end) passed
from user space.

> 
>>
>>>
>>>> +{
>>>> +	pte_t ptent;
>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>> +	pgoff_t pgoff, addr;
>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>> +
>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>> +
>>>> +	if (start < vma->vm_start)
>>>> +		start = vma->vm_start;
>>>> +	if (end > vma->vm_end)
>>>> +		end = vma->vm_end;
>>>> +
>>>> +	pgoff = folio_pgoff(folio);
>>>> +	/* if folio start address is not in vma range */
>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>> +		return false;
>>>> +
>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>> +		return false;
>>>> +
>>>> +	/* not necessary to check pte for none large folio */
>>>> +	if (!folio_test_large(folio))
>>>> +		return true;
>>>> +
>>>> +	if (!pte)
>>>> +		return false;
>>>> +
>>>> +	/* check whether parameter pte is associated with folio */
>>>> +	ptent = ptep_get(pte);
>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>> +		return false;
>>>> +
>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>> +		ptent = ptep_get(pte);
>>>> +
>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>> +			return false;
>>>> +	}
>>>
>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>> you intended start/end to always be within the same pgtable, but that is not
>>> guarranteed in the case that folio_within_vma() is making the call.
>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>> function just return false in page table entry check phase.
> 
> Sorry I don't think this covers the issue I'm describing. If you have a
> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
> problem?
> 
> example for 4K base page set up:
> 
> folio_nr_pages = 512
> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
> 
> If you then call this function with the pte pointer for the second page in the
> folio, which is mapped at address 0x200000, that pte is pointing to the first
> pte entry in the table pointed to by the second pmd entry. The pte pointer can
> be legitimately manipulated to point to any entry within that table,
> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
> from the pointer, intending that it now points to the pte entry that represents
> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
> table, pointed to by the first pmd entry.
Yes. This can be an issue as hold the second page table lock can't prevent the first
part unmapped. Let me add another check vaddr align to folio_size in next version. 

Regards
Yin, Fengwei

> 
> 
>>
>>>
>>> Also I want to check that this function is definitely always called under the
>>> PTL for the table that pte belongs to?
>> Yes. I should spell it out. Thanks.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>> +{
>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>> +}
>>>> +
>>>>  /*
>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>   * should be called with vma's mmap_lock held for read or write,
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 13:46         ` Yin, Fengwei
@ 2023-08-02 14:08           ` Ryan Roberts
  2023-08-02 14:14             ` Yin, Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 14:08 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 02/08/2023 14:46, Yin, Fengwei wrote:
> 
> 
> On 8/2/2023 9:09 PM, Ryan Roberts wrote:
>> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>>
>>>
>>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>> It will be used to check whether the folio is mapped to specific
>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>
>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>> is in the range of vma based on folio_in_range().
>>>>>
>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>> ---
>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 69 insertions(+)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>  				   bool write, int *locked);
>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>  			       unsigned long bytes);
>>>>> +
>>>>> +/*
>>>>> + * Check whether the folio is in specific range
>>>>> + *
>>>>> + * First, check whether the folio is in the range of vma.
>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>> + *
>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>> + *      large folio, the large folio could be mapped to to different
>>>>> + *      VMA and address check can't identify this situation.
>>>>> + */
>>>>> +static inline bool
>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>
>>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>>> My understanding is it's necessary. As for madvise, we need to check whether
>>> the folio is both in the range of VMA and also in the range of [start, end).
>>
>> But in folio_within_vma() you pass start as vma->vm_start and end as
>> vma->vm_end. And in this function, you narrow start/end to be completely
>> contained in vma. So surely there is only really one start/end you are
>> interested in? Just seems a bit odd to me.
> madvise() will call filio_in_range() with VMA and real range [start, end) passed
> from user space.
> 
>>
>>>
>>>>
>>>>> +{
>>>>> +	pte_t ptent;
>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>> +	pgoff_t pgoff, addr;
>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>> +
>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>> +
>>>>> +	if (start < vma->vm_start)
>>>>> +		start = vma->vm_start;
>>>>> +	if (end > vma->vm_end)
>>>>> +		end = vma->vm_end;
>>>>> +
>>>>> +	pgoff = folio_pgoff(folio);
>>>>> +	/* if folio start address is not in vma range */
>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>> +		return false;
>>>>> +
>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>> +		return false;
>>>>> +
>>>>> +	/* not necessary to check pte for none large folio */
>>>>> +	if (!folio_test_large(folio))
>>>>> +		return true;
>>>>> +
>>>>> +	if (!pte)
>>>>> +		return false;
>>>>> +
>>>>> +	/* check whether parameter pte is associated with folio */
>>>>> +	ptent = ptep_get(pte);
>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>> +		return false;
>>>>> +
>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>> +		ptent = ptep_get(pte);
>>>>> +
>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>> +			return false;
>>>>> +	}
>>>>
>>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>>> you intended start/end to always be within the same pgtable, but that is not
>>>> guarranteed in the case that folio_within_vma() is making the call.
>>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>>> function just return false in page table entry check phase.
>>
>> Sorry I don't think this covers the issue I'm describing. If you have a
>> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
>> problem?
>>
>> example for 4K base page set up:
>>
>> folio_nr_pages = 512
>> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
>>
>> If you then call this function with the pte pointer for the second page in the
>> folio, which is mapped at address 0x200000, that pte is pointing to the first
>> pte entry in the table pointed to by the second pmd entry. The pte pointer can
>> be legitimately manipulated to point to any entry within that table,
>> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
>> from the pointer, intending that it now points to the pte entry that represents
>> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
>> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
>> table, pointed to by the first pmd entry.
> Yes. This can be an issue as hold the second page table lock can't prevent the first
> part unmapped. Let me add another check vaddr align to folio_size in next version. 

Locking is a problem but its not the only problem. The 2 tables are almost
certainly not contiguous in virtual memory. So once you have moved the pointer
to before the start of the second table, then you are pointing to arbitrary memory.

> 
> Regards
> Yin, Fengwei
> 
>>
>>
>>>
>>>>
>>>> Also I want to check that this function is definitely always called under the
>>>> PTL for the table that pte belongs to?
>>> Yes. I should spell it out. Thanks.
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>> +static inline bool
>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>> +{
>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 14:08           ` Ryan Roberts
@ 2023-08-02 14:14             ` Yin, Fengwei
  2023-08-02 14:59               ` Ryan Roberts
  0 siblings, 1 reply; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-02 14:14 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/2023 10:08 PM, Ryan Roberts wrote:
> On 02/08/2023 14:46, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 9:09 PM, Ryan Roberts wrote:
>>> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>>> It will be used to check whether the folio is mapped to specific
>>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>>
>>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>>> is in the range of vma based on folio_in_range().
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> ---
>>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 69 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>>  				   bool write, int *locked);
>>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>>  			       unsigned long bytes);
>>>>>> +
>>>>>> +/*
>>>>>> + * Check whether the folio is in specific range
>>>>>> + *
>>>>>> + * First, check whether the folio is in the range of vma.
>>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>>> + *
>>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>>> + *      large folio, the large folio could be mapped to to different
>>>>>> + *      VMA and address check can't identify this situation.
>>>>>> + */
>>>>>> +static inline bool
>>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>>
>>>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>>>> My understanding is it's necessary. As for madvise, we need to check whether
>>>> the folio is both in the range of VMA and also in the range of [start, end).
>>>
>>> But in folio_within_vma() you pass start as vma->vm_start and end as
>>> vma->vm_end. And in this function, you narrow start/end to be completely
>>> contained in vma. So surely there is only really one start/end you are
>>> interested in? Just seems a bit odd to me.
>> madvise() will call filio_in_range() with VMA and real range [start, end) passed
>> from user space.
>>
>>>
>>>>
>>>>>
>>>>>> +{
>>>>>> +	pte_t ptent;
>>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>>> +	pgoff_t pgoff, addr;
>>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>>> +
>>>>>> +	if (start < vma->vm_start)
>>>>>> +		start = vma->vm_start;
>>>>>> +	if (end > vma->vm_end)
>>>>>> +		end = vma->vm_end;
>>>>>> +
>>>>>> +	pgoff = folio_pgoff(folio);
>>>>>> +	/* if folio start address is not in vma range */
>>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* not necessary to check pte for none large folio */
>>>>>> +	if (!folio_test_large(folio))
>>>>>> +		return true;
>>>>>> +
>>>>>> +	if (!pte)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* check whether parameter pte is associated with folio */
>>>>>> +	ptent = ptep_get(pte);
>>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>>> +		ptent = ptep_get(pte);
>>>>>> +
>>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>> +			return false;
>>>>>> +	}
>>>>>
>>>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>>>> you intended start/end to always be within the same pgtable, but that is not
>>>>> guarranteed in the case that folio_within_vma() is making the call.
>>>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>>>> function just return false in page table entry check phase.
>>>
>>> Sorry I don't think this covers the issue I'm describing. If you have a
>>> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
>>> problem?
>>>
>>> example for 4K base page set up:
>>>
>>> folio_nr_pages = 512
>>> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
>>>
>>> If you then call this function with the pte pointer for the second page in the
>>> folio, which is mapped at address 0x200000, that pte is pointing to the first
>>> pte entry in the table pointed to by the second pmd entry. The pte pointer can
>>> be legitimately manipulated to point to any entry within that table,
>>> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
>>> from the pointer, intending that it now points to the pte entry that represents
>>> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
>>> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
>>> table, pointed to by the first pmd entry.
>> Yes. This can be an issue as hold the second page table lock can't prevent the first
>> part unmapped. Let me add another check vaddr align to folio_size in next version. 
> 
> Locking is a problem but its not the only problem. The 2 tables are almost
> certainly not contiguous in virtual memory. So once you have moved the pointer
> to before the start of the second table, then you are pointing to arbitrary memory.
If vaddr is aligned to folio_size, suppose we are OK here (I have assumption that
large folio will not be larger than PMD size. Or it's possible on ARM platform?).


Regards
Yin, Fengwei

> 
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>
>>>>
>>>>>
>>>>> Also I want to check that this function is definitely always called under the
>>>>> PTL for the table that pte belongs to?
>>>> Yes. I should spell it out. Thanks.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>>> +
>>>>>> +	return true;
>>>>>> +}
>>>>>> +
>>>>>> +static inline bool
>>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>>> +{
>>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>>> +}
>>>>>> +
>>>>>>  /*
>>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 14:14             ` Yin, Fengwei
@ 2023-08-02 14:59               ` Ryan Roberts
  2023-08-03  0:24                 ` Yin Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 14:59 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 02/08/2023 15:14, Yin, Fengwei wrote:
> 
> 
> On 8/2/2023 10:08 PM, Ryan Roberts wrote:
>> On 02/08/2023 14:46, Yin, Fengwei wrote:
>>>
>>>
>>> On 8/2/2023 9:09 PM, Ryan Roberts wrote:
>>>> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>>>>
>>>>>
>>>>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>>>> It will be used to check whether the folio is mapped to specific
>>>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>>>
>>>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>>>> is in the range of vma based on folio_in_range().
>>>>>>>
>>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>>> ---
>>>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>  1 file changed, 69 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>>>> --- a/mm/internal.h
>>>>>>> +++ b/mm/internal.h
>>>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>>>  				   bool write, int *locked);
>>>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>>>  			       unsigned long bytes);
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Check whether the folio is in specific range
>>>>>>> + *
>>>>>>> + * First, check whether the folio is in the range of vma.
>>>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>>>> + *
>>>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>>>> + *      large folio, the large folio could be mapped to to different
>>>>>>> + *      VMA and address check can't identify this situation.
>>>>>>> + */
>>>>>>> +static inline bool
>>>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>>>
>>>>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>>>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>>>>> My understanding is it's necessary. As for madvise, we need to check whether
>>>>> the folio is both in the range of VMA and also in the range of [start, end).
>>>>
>>>> But in folio_within_vma() you pass start as vma->vm_start and end as
>>>> vma->vm_end. And in this function, you narrow start/end to be completely
>>>> contained in vma. So surely there is only really one start/end you are
>>>> interested in? Just seems a bit odd to me.
>>> madvise() will call filio_in_range() with VMA and real range [start, end) passed
>>> from user space.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> +{
>>>>>>> +	pte_t ptent;
>>>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>>>> +	pgoff_t pgoff, addr;
>>>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>>> +
>>>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>>>> +
>>>>>>> +	if (start < vma->vm_start)
>>>>>>> +		start = vma->vm_start;
>>>>>>> +	if (end > vma->vm_end)
>>>>>>> +		end = vma->vm_end;
>>>>>>> +
>>>>>>> +	pgoff = folio_pgoff(folio);
>>>>>>> +	/* if folio start address is not in vma range */
>>>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	/* not necessary to check pte for none large folio */
>>>>>>> +	if (!folio_test_large(folio))
>>>>>>> +		return true;
>>>>>>> +
>>>>>>> +	if (!pte)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	/* check whether parameter pte is associated with folio */
>>>>>>> +	ptent = ptep_get(pte);
>>>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>>>> +		ptent = ptep_get(pte);
>>>>>>> +
>>>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>>> +			return false;
>>>>>>> +	}
>>>>>>
>>>>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>>>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>>>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>>>>> you intended start/end to always be within the same pgtable, but that is not
>>>>>> guarranteed in the case that folio_within_vma() is making the call.
>>>>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>>>>> function just return false in page table entry check phase.
>>>>
>>>> Sorry I don't think this covers the issue I'm describing. If you have a
>>>> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
>>>> problem?
>>>>
>>>> example for 4K base page set up:
>>>>
>>>> folio_nr_pages = 512
>>>> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
>>>>
>>>> If you then call this function with the pte pointer for the second page in the
>>>> folio, which is mapped at address 0x200000, that pte is pointing to the first
>>>> pte entry in the table pointed to by the second pmd entry. The pte pointer can
>>>> be legitimately manipulated to point to any entry within that table,
>>>> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
>>>> from the pointer, intending that it now points to the pte entry that represents
>>>> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
>>>> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
>>>> table, pointed to by the first pmd entry.
>>> Yes. This can be an issue as hold the second page table lock can't prevent the first
>>> part unmapped. Let me add another check vaddr align to folio_size in next version. 
>>
>> Locking is a problem but its not the only problem. The 2 tables are almost
>> certainly not contiguous in virtual memory. So once you have moved the pointer
>> to before the start of the second table, then you are pointing to arbitrary memory.
> If vaddr is aligned to folio_size, suppose we are OK here (I have assumption that
> large folio will not be larger than PMD size. Or it's possible on ARM platform?).

I *think* your assumption that a folio will never be bigger than PMD size is ok.
(I'm guessing page cache never allocates bigger folios than that?).

But its a bad assumption to assume folios are always mapped in a naturally
aligned manner. mremapping a thp will cause non-natural alignment. User space
requesting a file (that is in a large folio in pagecache) to be mapped to
arbitrary (page-aligned) address will do that.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also I want to check that this function is definitely always called under the
>>>>>> PTL for the table that pte belongs to?
>>>>> Yes. I should spell it out. Thanks.
>>>>>
>>>>>
>>>>> Regards
>>>>> Yin, Fengwei
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> +	return true;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline bool
>>>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>>>> +{
>>>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 13:12       ` Yin, Fengwei
@ 2023-08-02 15:12         ` Ryan Roberts
  2023-08-03  1:36           ` Yin Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 15:12 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

>> I also wonder if you should change the name of folio_within_vma() to something
>> like folio_test_cont_in_vma() to disambiguate from the case where the folio may
>> be fully mapped with a discontiguity (although perhaps that's not possible
>> because a mremap would result in distinct vmas... would a new mmap in the hole
>> cause a merge of all 3?).
> I don't think it's possible as mremap reuse original pgoff of VMA to new VMA. I suppose
> it will prevent VMA merging. But I didn't check detail.

pgoff is not relevant for anon though, right?

FWIW, I wrote a test to check if merging is performed. Interestingly, v5.4 (on x86) *does* merge the VMAs in this case, but v6.5-rc3 (on arm64) *does not* merge the VMAs in this case.

I think you should assume it might be possible in some cases.


#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	size_t pgsize = getpagesize();
	char *memarea;
	char *memlow;
	char *memmid;
	char *memhigh;
	int ret = 0;

	// Get a free vm area big enough for 5 pages.
	memarea = mmap(NULL, pgsize * 5,
			PROT_NONE,
			MAP_PRIVATE | MAP_ANONYMOUS,
			-1, 0);
	if (memarea == MAP_FAILED) {
		perror("mmap 1");
		exit(1);
	}

	// Map 2 pages one page into allocated area.
	memlow = mmap(memarea + pgsize, pgsize * 2,
			PROT_READ | PROT_WRITE,
			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS,
			-1, 0);
	if (memlow == MAP_FAILED) {
		perror("mmap 2");
		exit(1);
	}

	// Move the second allocated page one page higher.
	memhigh = mremap(memarea + pgsize * 2, pgsize, pgsize,
			MREMAP_FIXED | MREMAP_MAYMOVE,
			memarea + pgsize * 3);
	if (memhigh == MAP_FAILED) {
		perror("mremap");
		exit(1);
	}

	// We should now have:
	// | page 0 | page 1 | page 2 | page 3 | page 4 |
	// | NONE   | vma 1  | empty  | vma 2  | NONE   |
	printf("Check for 2 vmas with hole: pid=%d, memarea=%p, memlow=%p, memhigh=%p\n",
		getpid(), memarea, memlow, memhigh);
	getchar();

	// Now map a page in the empty space.
	memmid = mmap(memarea + pgsize * 2, pgsize,
			PROT_READ | PROT_WRITE,
			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS,
			-1, 0);
	if (memmid == MAP_FAILED) {
		perror("mmap 2");
		exit(1);
	}

	// We should now have:
	// | page 0 | page 1 | page 2 | page 3 | page 4 |
	// | NONE   |          vma 1           | NONE   |
	printf("Check for single merged vma: pid=%d, memarea=%p, memlow=%p, memmid=%p, memhigh=%p\n",
		getpid(), memarea, memlow, memmid, memhigh);
	getchar();

	return ret;
}



Output on v5.4:

Check for 2 vmas with hole: pid=171038, memarea=0x7fe6c34d9000, memlow=0x7fe6c34da000, memhigh=0x7fe6c34dc000
Check for single merged vma: pid=171038, memarea=0x7fe6c34d9000, memlow=0x7fe6c34da000, memmid=0x7fe6c34db000, memhigh=0x7fe6c34dc000

And maps output at the 2 check points:

(base) ryarob01@e125769:/data_nvme0n1/ryarob01/granule_perf$ cat /proc/171038/maps
55e55c258000-55e55c259000 r--p 00000000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c259000-55e55c25a000 r-xp 00001000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25a000-55e55c25b000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25b000-55e55c25c000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25c000-55e55c25d000 rw-p 00003000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c403000-55e55c424000 rw-p 00000000 00:00 0                          [heap]
7fe6c32d2000-7fe6c32f4000 r--p 00000000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c32f4000-7fe6c346c000 r-xp 00022000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c346c000-7fe6c34ba000 r--p 0019a000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34ba000-7fe6c34be000 r--p 001e7000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34be000-7fe6c34c0000 rw-p 001eb000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34c0000-7fe6c34c6000 rw-p 00000000 00:00 0 
7fe6c34d9000-7fe6c34da000 ---p 00000000 00:00 0 
7fe6c34da000-7fe6c34db000 rw-p 00000000 00:00 0 
7fe6c34dc000-7fe6c34dd000 rw-p 00000000 00:00 0 
7fe6c34dd000-7fe6c34de000 ---p 00000000 00:00 0 
7fe6c34de000-7fe6c34df000 r--p 00000000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c34df000-7fe6c3502000 r-xp 00001000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c3502000-7fe6c350a000 r--p 00024000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350b000-7fe6c350c000 r--p 0002c000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350c000-7fe6c350d000 rw-p 0002d000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350d000-7fe6c350e000 rw-p 00000000 00:00 0 
7fff39a11000-7fff39a32000 rw-p 00000000 00:00 0                          [stack]
7fff39a83000-7fff39a86000 r--p 00000000 00:00 0                          [vvar]
7fff39a86000-7fff39a87000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
(base) ryarob01@e125769:/data_nvme0n1/ryarob01/granule_perf$ cat /proc/171038/maps
55e55c258000-55e55c259000 r--p 00000000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c259000-55e55c25a000 r-xp 00001000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25a000-55e55c25b000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25b000-55e55c25c000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c25c000-55e55c25d000 rw-p 00003000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
55e55c403000-55e55c424000 rw-p 00000000 00:00 0                          [heap]
7fe6c32d2000-7fe6c32f4000 r--p 00000000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c32f4000-7fe6c346c000 r-xp 00022000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c346c000-7fe6c34ba000 r--p 0019a000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34ba000-7fe6c34be000 r--p 001e7000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34be000-7fe6c34c0000 rw-p 001eb000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
7fe6c34c0000-7fe6c34c6000 rw-p 00000000 00:00 0 
7fe6c34d9000-7fe6c34da000 ---p 00000000 00:00 0 
7fe6c34da000-7fe6c34dd000 rw-p 00000000 00:00 0 
7fe6c34dd000-7fe6c34de000 ---p 00000000 00:00 0 
7fe6c34de000-7fe6c34df000 r--p 00000000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c34df000-7fe6c3502000 r-xp 00001000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c3502000-7fe6c350a000 r--p 00024000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350b000-7fe6c350c000 r--p 0002c000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350c000-7fe6c350d000 rw-p 0002d000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
7fe6c350d000-7fe6c350e000 rw-p 00000000 00:00 0 
7fff39a11000-7fff39a32000 rw-p 00000000 00:00 0                          [stack]
7fff39a83000-7fff39a86000 r--p 00000000 00:00 0                          [vvar]
7fff39a86000-7fff39a87000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]


Output on v6.5-rc3:

Check for 2 vmas with hole: pid=3181, memarea=0xfffff7ff2000, memlow=0xfffff7ff3000, memhigh=0xfffff7ff5000
Check for single merged vma: pid=3181, memarea=0xfffff7ff2000, memlow=0xfffff7ff3000, memmid=0xfffff7ff4000, memhigh=0xfffff7ff5000

And maps output at the 2 check points:

ubuntu@ubuntuvm:~/linux$ cat /proc/3181/maps 
aaaaaaaa0000-aaaaaaaa1000 r-xp 00000000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab0000-aaaaaaab1000 r--p 00000000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab1000-aaaaaaab2000 rw-p 00001000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab2000-aaaaaaad3000 rw-p 00000000 00:00 0                          [heap]
fffff7e00000-fffff7f89000 r-xp 00000000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f89000-fffff7f98000 ---p 00189000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f98000-fffff7f9c000 r--p 00188000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f9c000-fffff7f9e000 rw-p 0018c000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f9e000-fffff7faa000 rw-p 00000000 00:00 0 
fffff7fc2000-fffff7fed000 r-xp 00000000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffff7ff2000-fffff7ff3000 ---p 00000000 00:00 0 
fffff7ff3000-fffff7ff4000 rw-p 00000000 00:00 0 
fffff7ff5000-fffff7ff6000 rw-p 00000000 00:00 0 
fffff7ff6000-fffff7ff7000 ---p 00000000 00:00 0 
fffff7ff7000-fffff7ff9000 rw-p 00000000 00:00 0 
fffff7ff9000-fffff7ffb000 r--p 00000000 00:00 0                          [vvar]
fffff7ffb000-fffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
fffff7ffc000-fffff7ffe000 r--p 0002a000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffff7ffe000-fffff8000000 rw-p 0002c000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffffffdf000-1000000000000 rw-p 00000000 00:00 0                         [stack]
ubuntu@ubuntuvm:~/linux$ cat /proc/3181/maps 
aaaaaaaa0000-aaaaaaaa1000 r-xp 00000000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab0000-aaaaaaab1000 r--p 00000000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab1000-aaaaaaab2000 rw-p 00001000 fe:02 8199010                    /home/ubuntu/merge
aaaaaaab2000-aaaaaaad3000 rw-p 00000000 00:00 0                          [heap]
fffff7e00000-fffff7f89000 r-xp 00000000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f89000-fffff7f98000 ---p 00189000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f98000-fffff7f9c000 r--p 00188000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f9c000-fffff7f9e000 rw-p 0018c000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
fffff7f9e000-fffff7faa000 rw-p 00000000 00:00 0 
fffff7fc2000-fffff7fed000 r-xp 00000000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffff7ff2000-fffff7ff3000 ---p 00000000 00:00 0 
fffff7ff3000-fffff7ff4000 rw-p 00000000 00:00 0 
fffff7ff4000-fffff7ff5000 rw-p 00000000 00:00 0 
fffff7ff5000-fffff7ff6000 rw-p 00000000 00:00 0 
fffff7ff6000-fffff7ff7000 ---p 00000000 00:00 0 
fffff7ff7000-fffff7ff9000 rw-p 00000000 00:00 0 
fffff7ff9000-fffff7ffb000 r--p 00000000 00:00 0                          [vvar]
fffff7ffb000-fffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
fffff7ffc000-fffff7ffe000 r--p 0002a000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffff7ffe000-fffff8000000 rw-p 0002c000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffffffdf000-1000000000000 rw-p 00000000 00:00 0                         [stack]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
  2023-07-28 18:34   ` Andrew Morton
  2023-08-02 11:14   ` Ryan Roberts
@ 2023-08-02 15:15   ` Ryan Roberts
  2023-08-03  0:41     ` Yin Fengwei
  2023-08-03  9:58   ` Ryan Roberts
  3 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-02 15:15 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 28/07/2023 08:09, Yin Fengwei wrote:
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
> 
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a03bc4782a2..63de32154a48 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>  				   bool write, int *locked);
>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>  			       unsigned long bytes);
> +
> +/*
> + * Check whether the folio is in specific range
> + *
> + * First, check whether the folio is in the range of vma.
> + * Then, check whether the folio is mapped to the range of [start, end].
> + * In the end, check whether the folio is fully mapped to the range.
> + *
> + * @pte page table pointer will be checked whether the large folio
> + *      is fully mapped to. Currently, if mremap in the middle of
> + *      large folio, the large folio could be mapped to to different
> + *      VMA and address check can't identify this situation.
> + */
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, pte_t *pte)
> +{
> +	pte_t ptent;
> +	unsigned long i, nr = folio_nr_pages(folio);
> +	pgoff_t pgoff, addr;
> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> +
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	pgoff = folio_pgoff(folio);
> +	/* if folio start address is not in vma range */
> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
> +		return false;

I'm struggling with this logic. What happens for an anonymous folio in a
(private) file mapping? Wouldn't the folio's pgoff be 0? In this case you could
return false incorrectly?

> +
> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	if (addr < start || end - addr < folio_size(folio))
> +		return false;
> +
> +	/* not necessary to check pte for none large folio */
> +	if (!folio_test_large(folio))
> +		return true;
> +
> +	if (!pte)
> +		return false;
> +
> +	/* check whether parameter pte is associated with folio */
> +	ptent = ptep_get(pte);
> +	if (pte_none(ptent) || !pte_present(ptent) ||
> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +		return false;
> +
> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
> +	for (i = 0; i < nr; i++, pte++) {
> +		ptent = ptep_get(pte);
> +
> +		if (pte_none(ptent) || !pte_present(ptent) ||
> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline bool
> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
> +{
> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
> +}
> +
>  /*
>   * mlock_vma_folio() and munlock_vma_folio():
>   * should be called with vma's mmap_lock held for read or write,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 14:59               ` Ryan Roberts
@ 2023-08-03  0:24                 ` Yin Fengwei
  0 siblings, 0 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-08-03  0:24 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/23 22:59, Ryan Roberts wrote:
> On 02/08/2023 15:14, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 10:08 PM, Ryan Roberts wrote:
>>> On 02/08/2023 14:46, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 9:09 PM, Ryan Roberts wrote:
>>>>> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>>>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>>>>> It will be used to check whether the folio is mapped to specific
>>>>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>>>>
>>>>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>>>>> is in the range of vma based on folio_in_range().
>>>>>>>>
>>>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>>>> ---
>>>>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 69 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>>>>> --- a/mm/internal.h
>>>>>>>> +++ b/mm/internal.h
>>>>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>>>>  				   bool write, int *locked);
>>>>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>>>>  			       unsigned long bytes);
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Check whether the folio is in specific range
>>>>>>>> + *
>>>>>>>> + * First, check whether the folio is in the range of vma.
>>>>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>>>>> + *
>>>>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>>>>> + *      large folio, the large folio could be mapped to to different
>>>>>>>> + *      VMA and address check can't identify this situation.
>>>>>>>> + */
>>>>>>>> +static inline bool
>>>>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>>>>
>>>>>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>>>>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>>>>>> My understanding is it's necessary. As for madvise, we need to check whether
>>>>>> the folio is both in the range of VMA and also in the range of [start, end).
>>>>>
>>>>> But in folio_within_vma() you pass start as vma->vm_start and end as
>>>>> vma->vm_end. And in this function, you narrow start/end to be completely
>>>>> contained in vma. So surely there is only really one start/end you are
>>>>> interested in? Just seems a bit odd to me.
>>>> madvise() will call filio_in_range() with VMA and real range [start, end) passed
>>>> from user space.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +	pte_t ptent;
>>>>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>>>>> +	pgoff_t pgoff, addr;
>>>>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>>>> +
>>>>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>>>>> +
>>>>>>>> +	if (start < vma->vm_start)
>>>>>>>> +		start = vma->vm_start;
>>>>>>>> +	if (end > vma->vm_end)
>>>>>>>> +		end = vma->vm_end;
>>>>>>>> +
>>>>>>>> +	pgoff = folio_pgoff(folio);
>>>>>>>> +	/* if folio start address is not in vma range */
>>>>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	/* not necessary to check pte for none large folio */
>>>>>>>> +	if (!folio_test_large(folio))
>>>>>>>> +		return true;
>>>>>>>> +
>>>>>>>> +	if (!pte)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	/* check whether parameter pte is associated with folio */
>>>>>>>> +	ptent = ptep_get(pte);
>>>>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>>>>> +		ptent = ptep_get(pte);
>>>>>>>> +
>>>>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>>>> +			return false;
>>>>>>>> +	}
>>>>>>>
>>>>>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>>>>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>>>>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>>>>>> you intended start/end to always be within the same pgtable, but that is not
>>>>>>> guarranteed in the case that folio_within_vma() is making the call.
>>>>>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>>>>>> function just return false in page table entry check phase.
>>>>>
>>>>> Sorry I don't think this covers the issue I'm describing. If you have a
>>>>> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
>>>>> problem?
>>>>>
>>>>> example for 4K base page set up:
>>>>>
>>>>> folio_nr_pages = 512
>>>>> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
>>>>>
>>>>> If you then call this function with the pte pointer for the second page in the
>>>>> folio, which is mapped at address 0x200000, that pte is pointing to the first
>>>>> pte entry in the table pointed to by the second pmd entry. The pte pointer can
>>>>> be legitimately manipulated to point to any entry within that table,
>>>>> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
>>>>> from the pointer, intending that it now points to the pte entry that represents
>>>>> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
>>>>> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
>>>>> table, pointed to by the first pmd entry.
>>>> Yes. This can be an issue as hold the second page table lock can't prevent the first
>>>> part unmapped. Let me add another check vaddr align to folio_size in next version. 
>>>
>>> Locking is a problem but its not the only problem. The 2 tables are almost
>>> certainly not contiguous in virtual memory. So once you have moved the pointer
>>> to before the start of the second table, then you are pointing to arbitrary memory.
>> If vaddr is aligned to folio_size, suppose we are OK here (I have assumption that
>> large folio will not be larger than PMD size. Or it's possible on ARM platform?).
> 
> I *think* your assumption that a folio will never be bigger than PMD size is ok.
> (I'm guessing page cache never allocates bigger folios than that?).> 
> But its a bad assumption to assume folios are always mapped in a naturally
> aligned manner. mremapping a thp will cause non-natural alignment. User space
> requesting a file (that is in a large folio in pagecache) to be mapped to
> arbitrary (page-aligned) address will do that.
The none-aligned THP is what I want to filter out here. But I totally forget the
page cache which could cause same thing as mremap. I will use another way to check it.
Thanks.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also I want to check that this function is definitely always called under the
>>>>>>> PTL for the table that pte belongs to?
>>>>>> Yes. I should spell it out. Thanks.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +	return true;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline bool
>>>>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>>>>> +{
>>>>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 15:15   ` Ryan Roberts
@ 2023-08-03  0:41     ` Yin Fengwei
  0 siblings, 0 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-08-03  0:41 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/23 23:15, Ryan Roberts wrote:
> On 28/07/2023 08:09, Yin Fengwei wrote:
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 69 insertions(+)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 5a03bc4782a2..63de32154a48 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
>> +/*
>> + * Check whether the folio is in specific range
>> + *
>> + * First, check whether the folio is in the range of vma.
>> + * Then, check whether the folio is mapped to the range of [start, end].
>> + * In the end, check whether the folio is fully mapped to the range.
>> + *
>> + * @pte page table pointer will be checked whether the large folio
>> + *      is fully mapped to. Currently, if mremap in the middle of
>> + *      large folio, the large folio could be mapped to to different
>> + *      VMA and address check can't identify this situation.
>> + */
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end, pte_t *pte)
>> +{
>> +	pte_t ptent;
>> +	unsigned long i, nr = folio_nr_pages(folio);
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +	/* if folio start address is not in vma range */
>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>> +		return false;
> 
> I'm struggling with this logic. What happens for an anonymous folio in a
> (private) file mapping? Wouldn't the folio's pgoff be 0? In this case you could
> return false incorrectly?
You mean cow page for a file mapping?
I suppose the pgoff of folio (folio->index) is not 0 unless the folio is mapped
to vaddr 0. __page_set_anon_rmap() will set the folio->index based on vma->pgoff.


Regards
Yin, Fengwei

> 
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	if (addr < start || end - addr < folio_size(folio))
>> +		return false;
>> +
>> +	/* not necessary to check pte for none large folio */
>> +	if (!folio_test_large(folio))
>> +		return true;
>> +
>> +	if (!pte)
>> +		return false;
>> +
>> +	/* check whether parameter pte is associated with folio */
>> +	ptent = ptep_get(pte);
>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +		return false;
>> +
>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>> +	for (i = 0; i < nr; i++, pte++) {
>> +		ptent = ptep_get(pte);
>> +
>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +			return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>> +}
>> +
>>  /*
>>   * mlock_vma_folio() and munlock_vma_folio():
>>   * should be called with vma's mmap_lock held for read or write,
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-02 15:12         ` Ryan Roberts
@ 2023-08-03  1:36           ` Yin Fengwei
  0 siblings, 0 replies; 26+ messages in thread
From: Yin Fengwei @ 2023-08-03  1:36 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/2/23 23:12, Ryan Roberts wrote:
>>> I also wonder if you should change the name of folio_within_vma() to something
>>> like folio_test_cont_in_vma() to disambiguate from the case where the folio may
>>> be fully mapped with a discontiguity (although perhaps that's not possible
>>> because a mremap would result in distinct vmas... would a new mmap in the hole
>>> cause a merge of all 3?).
>> I don't think it's possible as mremap reuse original pgoff of VMA to new VMA. I suppose
>> it will prevent VMA merging. But I didn't check detail.
> 
> pgoff is not relevant for anon though, right?
anonymous folio->index is related with vma->vm_pgoff.

> 
> FWIW, I wrote a test to check if merging is performed. Interestingly, v5.4 (on x86) *does* merge the VMAs in this case, but v6.5-rc3 (on arm64) *does not* merge the VMAs in this case.
What if you fault in the pages to the mapped VMAs?

> 
> I think you should assume it might be possible in some cases.
I don't think mremap target VMA could be merged with prev/next
unless it's merged back (move the pieces large folio together).


Regards
Yin, Fengwei

> 
> 
> #define _GNU_SOURCE
> #include <stdlib.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> int main(int argc, char **argv)
> {
> 	size_t pgsize = getpagesize();
> 	char *memarea;
> 	char *memlow;
> 	char *memmid;
> 	char *memhigh;
> 	int ret = 0;
> 
> 	// Get a free vm area big enough for 5 pages.
> 	memarea = mmap(NULL, pgsize * 5,
> 			PROT_NONE,
> 			MAP_PRIVATE | MAP_ANONYMOUS,
> 			-1, 0);
> 	if (memarea == MAP_FAILED) {
> 		perror("mmap 1");
> 		exit(1);
> 	}
> 
> 	// Map 2 pages one page into allocated area.
> 	memlow = mmap(memarea + pgsize, pgsize * 2,
> 			PROT_READ | PROT_WRITE,
> 			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS,
> 			-1, 0);
> 	if (memlow == MAP_FAILED) {
> 		perror("mmap 2");
> 		exit(1);
> 	}
> 
> 	// Move the second allocated page one page higher.
> 	memhigh = mremap(memarea + pgsize * 2, pgsize, pgsize,
> 			MREMAP_FIXED | MREMAP_MAYMOVE,
> 			memarea + pgsize * 3);
> 	if (memhigh == MAP_FAILED) {
> 		perror("mremap");
> 		exit(1);
> 	}
> 
> 	// We should now have:
> 	// | page 0 | page 1 | page 2 | page 3 | page 4 |
> 	// | NONE   | vma 1  | empty  | vma 2  | NONE   |
> 	printf("Check for 2 vmas with hole: pid=%d, memarea=%p, memlow=%p, memhigh=%p\n",
> 		getpid(), memarea, memlow, memhigh);
> 	getchar();
> 
> 	// Now map a page in the empty space.
> 	memmid = mmap(memarea + pgsize * 2, pgsize,
> 			PROT_READ | PROT_WRITE,
> 			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS,
> 			-1, 0);
> 	if (memmid == MAP_FAILED) {
> 		perror("mmap 2");
> 		exit(1);
> 	}
> 
> 	// We should now have:
> 	// | page 0 | page 1 | page 2 | page 3 | page 4 |
> 	// | NONE   |          vma 1           | NONE   |
> 	printf("Check for single merged vma: pid=%d, memarea=%p, memlow=%p, memmid=%p, memhigh=%p\n",
> 		getpid(), memarea, memlow, memmid, memhigh);
> 	getchar();
> 
> 	return ret;
> }
> 
> 
> 
> Output on v5.4:
> 
> Check for 2 vmas with hole: pid=171038, memarea=0x7fe6c34d9000, memlow=0x7fe6c34da000, memhigh=0x7fe6c34dc000
> Check for single merged vma: pid=171038, memarea=0x7fe6c34d9000, memlow=0x7fe6c34da000, memmid=0x7fe6c34db000, memhigh=0x7fe6c34dc000
> 
> And maps output at the 2 check points:
> 
> (base) ryarob01@e125769:/data_nvme0n1/ryarob01/granule_perf$ cat /proc/171038/maps
> 55e55c258000-55e55c259000 r--p 00000000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c259000-55e55c25a000 r-xp 00001000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25a000-55e55c25b000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25b000-55e55c25c000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25c000-55e55c25d000 rw-p 00003000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c403000-55e55c424000 rw-p 00000000 00:00 0                          [heap]
> 7fe6c32d2000-7fe6c32f4000 r--p 00000000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c32f4000-7fe6c346c000 r-xp 00022000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c346c000-7fe6c34ba000 r--p 0019a000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34ba000-7fe6c34be000 r--p 001e7000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34be000-7fe6c34c0000 rw-p 001eb000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34c0000-7fe6c34c6000 rw-p 00000000 00:00 0 
> 7fe6c34d9000-7fe6c34da000 ---p 00000000 00:00 0 
> 7fe6c34da000-7fe6c34db000 rw-p 00000000 00:00 0 
> 7fe6c34dc000-7fe6c34dd000 rw-p 00000000 00:00 0 
> 7fe6c34dd000-7fe6c34de000 ---p 00000000 00:00 0 
> 7fe6c34de000-7fe6c34df000 r--p 00000000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c34df000-7fe6c3502000 r-xp 00001000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c3502000-7fe6c350a000 r--p 00024000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350b000-7fe6c350c000 r--p 0002c000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350c000-7fe6c350d000 rw-p 0002d000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350d000-7fe6c350e000 rw-p 00000000 00:00 0 
> 7fff39a11000-7fff39a32000 rw-p 00000000 00:00 0                          [stack]
> 7fff39a83000-7fff39a86000 r--p 00000000 00:00 0                          [vvar]
> 7fff39a86000-7fff39a87000 r-xp 00000000 00:00 0                          [vdso]
> ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
> (base) ryarob01@e125769:/data_nvme0n1/ryarob01/granule_perf$ cat /proc/171038/maps
> 55e55c258000-55e55c259000 r--p 00000000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c259000-55e55c25a000 r-xp 00001000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25a000-55e55c25b000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25b000-55e55c25c000 r--p 00002000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c25c000-55e55c25d000 rw-p 00003000 fd:00 5297466                    /data_nvme0n1/ryarob01/granule_perf/merge
> 55e55c403000-55e55c424000 rw-p 00000000 00:00 0                          [heap]
> 7fe6c32d2000-7fe6c32f4000 r--p 00000000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c32f4000-7fe6c346c000 r-xp 00022000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c346c000-7fe6c34ba000 r--p 0019a000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34ba000-7fe6c34be000 r--p 001e7000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34be000-7fe6c34c0000 rw-p 001eb000 fd:02 9573653                    /lib/x86_64-linux-gnu/libc-2.31.so
> 7fe6c34c0000-7fe6c34c6000 rw-p 00000000 00:00 0 
> 7fe6c34d9000-7fe6c34da000 ---p 00000000 00:00 0 
> 7fe6c34da000-7fe6c34dd000 rw-p 00000000 00:00 0 
> 7fe6c34dd000-7fe6c34de000 ---p 00000000 00:00 0 
> 7fe6c34de000-7fe6c34df000 r--p 00000000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c34df000-7fe6c3502000 r-xp 00001000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c3502000-7fe6c350a000 r--p 00024000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350b000-7fe6c350c000 r--p 0002c000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350c000-7fe6c350d000 rw-p 0002d000 fd:02 9573649                    /lib/x86_64-linux-gnu/ld-2.31.so
> 7fe6c350d000-7fe6c350e000 rw-p 00000000 00:00 0 
> 7fff39a11000-7fff39a32000 rw-p 00000000 00:00 0                          [stack]
> 7fff39a83000-7fff39a86000 r--p 00000000 00:00 0                          [vvar]
> 7fff39a86000-7fff39a87000 r-xp 00000000 00:00 0                          [vdso]
> ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
> 
> 
> Output on v6.5-rc3:
> 
> Check for 2 vmas with hole: pid=3181, memarea=0xfffff7ff2000, memlow=0xfffff7ff3000, memhigh=0xfffff7ff5000
> Check for single merged vma: pid=3181, memarea=0xfffff7ff2000, memlow=0xfffff7ff3000, memmid=0xfffff7ff4000, memhigh=0xfffff7ff5000
> 
> And maps output at the 2 check points:
> 
> ubuntu@ubuntuvm:~/linux$ cat /proc/3181/maps 
> aaaaaaaa0000-aaaaaaaa1000 r-xp 00000000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab0000-aaaaaaab1000 r--p 00000000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab1000-aaaaaaab2000 rw-p 00001000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab2000-aaaaaaad3000 rw-p 00000000 00:00 0                          [heap]
> fffff7e00000-fffff7f89000 r-xp 00000000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f89000-fffff7f98000 ---p 00189000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f98000-fffff7f9c000 r--p 00188000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f9c000-fffff7f9e000 rw-p 0018c000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f9e000-fffff7faa000 rw-p 00000000 00:00 0 
> fffff7fc2000-fffff7fed000 r-xp 00000000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffff7ff2000-fffff7ff3000 ---p 00000000 00:00 0 
> fffff7ff3000-fffff7ff4000 rw-p 00000000 00:00 0 
> fffff7ff5000-fffff7ff6000 rw-p 00000000 00:00 0 
> fffff7ff6000-fffff7ff7000 ---p 00000000 00:00 0 
> fffff7ff7000-fffff7ff9000 rw-p 00000000 00:00 0 
> fffff7ff9000-fffff7ffb000 r--p 00000000 00:00 0                          [vvar]
> fffff7ffb000-fffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
> fffff7ffc000-fffff7ffe000 r--p 0002a000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffff7ffe000-fffff8000000 rw-p 0002c000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffffffdf000-1000000000000 rw-p 00000000 00:00 0                         [stack]
> ubuntu@ubuntuvm:~/linux$ cat /proc/3181/maps 
> aaaaaaaa0000-aaaaaaaa1000 r-xp 00000000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab0000-aaaaaaab1000 r--p 00000000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab1000-aaaaaaab2000 rw-p 00001000 fe:02 8199010                    /home/ubuntu/merge
> aaaaaaab2000-aaaaaaad3000 rw-p 00000000 00:00 0                          [heap]
> fffff7e00000-fffff7f89000 r-xp 00000000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f89000-fffff7f98000 ---p 00189000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f98000-fffff7f9c000 r--p 00188000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f9c000-fffff7f9e000 rw-p 0018c000 fe:02 41410085                   /usr/lib/aarch64-linux-gnu/libc.so.6
> fffff7f9e000-fffff7faa000 rw-p 00000000 00:00 0 
> fffff7fc2000-fffff7fed000 r-xp 00000000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffff7ff2000-fffff7ff3000 ---p 00000000 00:00 0 
> fffff7ff3000-fffff7ff4000 rw-p 00000000 00:00 0 
> fffff7ff4000-fffff7ff5000 rw-p 00000000 00:00 0 
> fffff7ff5000-fffff7ff6000 rw-p 00000000 00:00 0 
> fffff7ff6000-fffff7ff7000 ---p 00000000 00:00 0 
> fffff7ff7000-fffff7ff9000 rw-p 00000000 00:00 0 
> fffff7ff9000-fffff7ffb000 r--p 00000000 00:00 0                          [vvar]
> fffff7ffb000-fffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
> fffff7ffc000-fffff7ffe000 r--p 0002a000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffff7ffe000-fffff8000000 rw-p 0002c000 fe:02 41316494                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> fffffffdf000-1000000000000 rw-p 00000000 00:00 0                         [stack]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
                     ` (2 preceding siblings ...)
  2023-08-02 15:15   ` Ryan Roberts
@ 2023-08-03  9:58   ` Ryan Roberts
  2023-08-03 10:48     ` Yin Fengwei
  3 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-03  9:58 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 28/07/2023 08:09, Yin Fengwei wrote:
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
> 
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a03bc4782a2..63de32154a48 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>  				   bool write, int *locked);
>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>  			       unsigned long bytes);
> +

Hi Yin,

I wanted to take a step back and consolidate my concerns about this function. I
should say that my understanding of the pgoff and index stuff is shaky and I
could therefore be wrong about some of this; if this is the case, then sorry for
the noise! But something about this function doesn't smell right to me, so I
figure its better to raise it...

> +/*
> + * Check whether the folio is in specific range

What exactly is the function trying to do? I *think* the intention is to figure
out if a folio is fully and contiguously mapped within a range of virtual
addresses belonging to a specific virtual address space? And I assume you want
the answer to be precise? I'm assuming 'yes' for the below.

I think the only way to do this is to actually check each PTE. And that causes a
problem, because a folio can straddle multiple PTE tables, which causes PTL
locking issues, and means having just a *pte pointer is insufficient if we need
to traverse multiple pte tables. So perhaps you need to allow for a false
negative in the case that the folio is not contained within a single pte table.

> + *
> + * First, check whether the folio is in the range of vma.
> + * Then, check whether the folio is mapped to the range of [start, end].
> + * In the end, check whether the folio is fully mapped to the range.
> + *
> + * @pte page table pointer will be checked whether the large folio
> + *      is fully mapped to. Currently, if mremap in the middle of
> + *      large folio, the large folio could be mapped to to different
> + *      VMA and address check can't identify this situation.
> + */
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, pte_t *pte)

The prototype looks odd to me; Fundamentally it seems to me that you need the
folio, start and end virtual addresses (the range you want to check that it is
in), the pte pointer and the virtual address that the pte represents. That
virtual address allows you to figure out the offset between the pa and va. Then
you can look at all the ptes to figure out if they reference the folio's pfns,
and use the va to pa mapping info to figure out if its within the specified range.

I don't really understand why the vma is useful.

> +{
> +	pte_t ptent;
> +	unsigned long i, nr = folio_nr_pages(folio);
> +	pgoff_t pgoff, addr;
> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> +
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	pgoff = folio_pgoff(folio);
> +	/* if folio start address is not in vma range */
> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
> +		return false;

Why is this pgoff calculation helpful? Surely it's only useful if the folio
belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
might be referring to a completely different file than the vma's vm_pgoff. So
you will get spurious results.

> +
> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	if (addr < start || end - addr < folio_size(folio))
> +		return false;
> +
> +	/* not necessary to check pte for none large folio */
> +	if (!folio_test_large(folio))
> +		return true;

I don't understand why you don't need to check the pte for a small folio? As
above, if the folio doesn't belong to the file that the vma is mapping the above
checks seem wrong and you can't conclude that the folio is mapped in the range
without looking at the pte?

> +
> +	if (!pte)
> +		return false;
> +
> +	/* check whether parameter pte is associated with folio */
> +	ptent = ptep_get(pte);
> +	if (pte_none(ptent) || !pte_present(ptent) ||
> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
> +		return false;
> +
> +	pte -= pte_pfn(ptent) - folio_pfn(folio);

I think this could wander off the front or back of the pte table into arbitrary
memory if the folio is strddling multiple pte tables.

> +	for (i = 0; i < nr; i++, pte++) {
> +		ptent = ptep_get(pte);
> +
> +		if (pte_none(ptent) || !pte_present(ptent) ||
> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)

Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
redundant?

Thanks,
Ryan


> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline bool
> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
> +{
> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
> +}
> +
>  /*
>   * mlock_vma_folio() and munlock_vma_folio():
>   * should be called with vma's mmap_lock held for read or write,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-03  9:58   ` Ryan Roberts
@ 2023-08-03 10:48     ` Yin Fengwei
  2023-08-03 13:20       ` Ryan Roberts
  0 siblings, 1 reply; 26+ messages in thread
From: Yin Fengwei @ 2023-08-03 10:48 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/3/23 17:58, Ryan Roberts wrote:
> On 28/07/2023 08:09, Yin Fengwei wrote:
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 69 insertions(+)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 5a03bc4782a2..63de32154a48 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
> 
> Hi Yin,
> 
> I wanted to take a step back and consolidate my concerns about this function. I
> should say that my understanding of the pgoff and index stuff is shaky and I
> could therefore be wrong about some of this; if this is the case, then sorry for
> the noise! But something about this function doesn't smell right to me, so I
> figure its better to raise it...
No worries. And thank you for looking at the code ans share the comments.
That helps me a lot. Really appreciate it.

> 
>> +/*
>> + * Check whether the folio is in specific range
> 
> What exactly is the function trying to do? I *think* the intention is to figure
> out if a folio is fully and contiguously mapped within a range of virtual
> addresses belonging to a specific virtual address space? And I assume you want
> the answer to be precise? I'm assuming 'yes' for the below.
> 
> I think the only way to do this is to actually check each PTE. And that causes a
> problem, because a folio can straddle multiple PTE tables, which causes PTL
> locking issues, and means having just a *pte pointer is insufficient if we need
> to traverse multiple pte tables. So perhaps you need to allow for a false
> negative in the case that the folio is not contained within a single pte table.
Let's check the users of this function first:
 mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
 madvise: Needs to check whether the folio is in the range of VMA and
          range [start, end). This range is from user space. It could contain
          VMA range (in this case, we only need to check VMA range) or is contained
          by VMA range.

So we check:
  1. Whether the folio is in the range.

     To do this, we need to first check whether the folio->index is in the
     VMA range. If not, we know the folio is not in VMA range. Just return false
     without further check.

     Then, we will check whether the folio is in the range which is defined as
     [max(start, vma->vm_start), min(end, vma->vm_end)).


     This is the version in RFC as I was not aware of mremap case and forgot the
     page cache case. So if no mremap with an anonymous folio, this check is enough.
     But we need to add PTE check for mremap and page cache cases.

  2. Check PTE for mremap in middle of large folio and page cache
     If the folio is normal 4K and if the folio is in the range, it's not possible
     the folio is partially mapped to two different VMA. So we are sure this folio
     is in the range.

     Currently, check PTE didn't handle the case you mentioned. But my plan is
     checking whether the folio is mapped cross page table (by checking whether
     the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
     If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
     current check because we know the PTE pointer is always valid.

     Obviously, the PTE checking needs hold pte lock as you pointed out.


My understanding:
pgoff is important if we have folio and VMA and want to know the virtual address of
the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
vaddr by:
      addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
implementation.

> 
>> + *
>> + * First, check whether the folio is in the range of vma.
>> + * Then, check whether the folio is mapped to the range of [start, end].
>> + * In the end, check whether the folio is fully mapped to the range.
>> + *
>> + * @pte page table pointer will be checked whether the large folio
>> + *      is fully mapped to. Currently, if mremap in the middle of
>> + *      large folio, the large folio could be mapped to to different
>> + *      VMA and address check can't identify this situation.
>> + */
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end, pte_t *pte)
> 
> The prototype looks odd to me; Fundamentally it seems to me that you need the
> folio, start and end virtual addresses (the range you want to check that it is
> in), the pte pointer and the virtual address that the pte represents. That
> virtual address allows you to figure out the offset between the pa and va. Then
> you can look at all the ptes to figure out if they reference the folio's pfns,
> and use the va to pa mapping info to figure out if its within the specified range.
> 
> I don't really understand why the vma is useful.
> 
>> +{
>> +	pte_t ptent;
>> +	unsigned long i, nr = folio_nr_pages(folio);
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +	/* if folio start address is not in vma range */
>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>> +		return false;
> 
> Why is this pgoff calculation helpful? Surely it's only useful if the folio
> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
> might be referring to a completely different file than the vma's vm_pgoff. So
> you will get spurious results.
> 
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	if (addr < start || end - addr < folio_size(folio))
>> +		return false;
>> +
>> +	/* not necessary to check pte for none large folio */
>> +	if (!folio_test_large(folio))
>> +		return true;
> 
> I don't understand why you don't need to check the pte for a small folio? As
> above, if the folio doesn't belong to the file that the vma is mapping the above
> checks seem wrong and you can't conclude that the folio is mapped in the range
> without looking at the pte?
> 
>> +
>> +	if (!pte)
>> +		return false;
>> +
>> +	/* check whether parameter pte is associated with folio */
>> +	ptent = ptep_get(pte);
>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>> +		return false;
>> +
>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
> 
> I think this could wander off the front or back of the pte table into arbitrary
> memory if the folio is strddling multiple pte tables.
> 
>> +	for (i = 0; i < nr; i++, pte++) {
>> +		ptent = ptep_get(pte);
>> +
>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
> 
> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
> redundant?
I think you are right. pte_none() can be removed here.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
>> +			return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>> +}
>> +
>>  /*
>>   * mlock_vma_folio() and munlock_vma_folio():
>>   * should be called with vma's mmap_lock held for read or write,
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-03 10:48     ` Yin Fengwei
@ 2023-08-03 13:20       ` Ryan Roberts
  2023-08-03 23:15         ` Yin, Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-03 13:20 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 03/08/2023 11:48, Yin Fengwei wrote:
> 
> 
> On 8/3/23 17:58, Ryan Roberts wrote:
>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>> It will be used to check whether the folio is mapped to specific
>>> VMA and whether the mapping address of folio is in the range.
>>>
>>> Also a helper function folio_within_vma() to check whether folio
>>> is in the range of vma based on folio_in_range().
>>>
>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 69 insertions(+)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 5a03bc4782a2..63de32154a48 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>  				   bool write, int *locked);
>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>  			       unsigned long bytes);
>>> +
>>
>> Hi Yin,
>>
>> I wanted to take a step back and consolidate my concerns about this function. I
>> should say that my understanding of the pgoff and index stuff is shaky and I
>> could therefore be wrong about some of this; if this is the case, then sorry for
>> the noise! But something about this function doesn't smell right to me, so I
>> figure its better to raise it...
> No worries. And thank you for looking at the code ans share the comments.
> That helps me a lot. Really appreciate it.
> 
>>
>>> +/*
>>> + * Check whether the folio is in specific range
>>
>> What exactly is the function trying to do? I *think* the intention is to figure
>> out if a folio is fully and contiguously mapped within a range of virtual
>> addresses belonging to a specific virtual address space? And I assume you want
>> the answer to be precise? I'm assuming 'yes' for the below.
>>
>> I think the only way to do this is to actually check each PTE. And that causes a
>> problem, because a folio can straddle multiple PTE tables, which causes PTL
>> locking issues, and means having just a *pte pointer is insufficient if we need
>> to traverse multiple pte tables. So perhaps you need to allow for a false
>> negative in the case that the folio is not contained within a single pte table.
> Let's check the users of this function first:
>  mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
>  madvise: Needs to check whether the folio is in the range of VMA and
>           range [start, end).

Sure but those 2 ranges (the vma and the user-supplied range) are known to
intersect, right? So really there is only one range of interest; the
intersection. I would argue that should be done in a higher level wrapper, and
is not part of the core primitive to work out if a folio is mapped to a
particular range of virtual addresses.

> This range is from user space. It could contain
>           VMA range (in this case, we only need to check VMA range) or is contained
>           by VMA range.
> 
> So we check:
>   1. Whether the folio is in the range.
> 
>      To do this, we need to first check whether the folio->index is in the
>      VMA range. If not, we know the folio is not in VMA range. Just return false
>      without further check.

Right, so if the check fails, the folio is definitely not mapped by the vma, but
if it passes, it *might* be. Further checks are required. So why is this useful?
Why not just do the check that gives you the precise answer and skip this?

> 
>      Then, we will check whether the folio is in the range which is defined as
>      [max(start, vma->vm_start), min(end, vma->vm_end)).

As par above comment, I would personally factor this out of the primitive.

> 
> 
>      This is the version in RFC as I was not aware of mremap case and forgot the
>      page cache case. So if no mremap with an anonymous folio, this check is enough.
>      But we need to add PTE check for mremap and page cache cases.
> 
>   2. Check PTE for mremap in middle of large folio and page cache
>      If the folio is normal 4K and if the folio is in the range, it's not possible
>      the folio is partially mapped to two different VMA. So we are sure this folio
>      is in the range.

But you test for small folio and exit early without checking the PTE. Given the
index check only told you that the folio *might be* mapped, I don't see how you
can return true at this point for a small folio, without looking at the PTE?

folio->index is just the page offset within the file it maps (or the (original)
VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset
within the file that the vma starts at. So you could have a folio that contains
the contents of 1 file and a vma that maps another file, and the folio's index
might fall within the VMA, but it doesn't mean the folio is mapped by the vma;
its a different file.

Or perhaps there is an assumption in the function's spec that the folio is known
to have at least one page mapped in the vma? That would change things... but you
should make that very clear in the spec. And in that case, you can move the
small folio check to the start and return true immediately.

> 
>      Currently, check PTE didn't handle the case you mentioned. But my plan is
>      checking whether the folio is mapped cross page table (by checking whether
>      the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
>      If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
>      current check because we know the PTE pointer is always valid.
> 
>      Obviously, the PTE checking needs hold pte lock as you pointed out.
> 
> 
> My understanding:
> pgoff is important if we have folio and VMA and want to know the virtual address of
> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
> vaddr by:
>       addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
> implementation.

Yes but the rmap is only traversing vmas that are already known to map the same
file that the folio contains pages for. (See rmap_walk_file(): it grabs the
"mapping" from the folio, then traverses the list of vmas within the range of
interest "vma_interval_tree_foreach"). Its a bit more complicated for anon
memory but I think the princple is the same.

> 
>>
>>> + *
>>> + * First, check whether the folio is in the range of vma.
>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>> + * In the end, check whether the folio is fully mapped to the range.
>>> + *
>>> + * @pte page table pointer will be checked whether the large folio
>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>> + *      large folio, the large folio could be mapped to to different
>>> + *      VMA and address check can't identify this situation.
>>> + */
>>> +static inline bool
>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>
>> The prototype looks odd to me; Fundamentally it seems to me that you need the
>> folio, start and end virtual addresses (the range you want to check that it is
>> in), the pte pointer and the virtual address that the pte represents. That
>> virtual address allows you to figure out the offset between the pa and va. Then
>> you can look at all the ptes to figure out if they reference the folio's pfns,
>> and use the va to pa mapping info to figure out if its within the specified range.
>>
>> I don't really understand why the vma is useful.
>>
>>> +{
>>> +	pte_t ptent;
>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>> +	pgoff_t pgoff, addr;
>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>> +
>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>> +
>>> +	if (start < vma->vm_start)
>>> +		start = vma->vm_start;
>>> +	if (end > vma->vm_end)
>>> +		end = vma->vm_end;
>>> +
>>> +	pgoff = folio_pgoff(folio);
>>> +	/* if folio start address is not in vma range */
>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>> +		return false;
>>
>> Why is this pgoff calculation helpful? Surely it's only useful if the folio
>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
>> might be referring to a completely different file than the vma's vm_pgoff. So
>> you will get spurious results.
>>
>>> +
>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>> +	if (addr < start || end - addr < folio_size(folio))
>>> +		return false;
>>> +
>>> +	/* not necessary to check pte for none large folio */
>>> +	if (!folio_test_large(folio))
>>> +		return true;
>>
>> I don't understand why you don't need to check the pte for a small folio? As
>> above, if the folio doesn't belong to the file that the vma is mapping the above
>> checks seem wrong and you can't conclude that the folio is mapped in the range
>> without looking at the pte?
>>
>>> +
>>> +	if (!pte)
>>> +		return false;
>>> +
>>> +	/* check whether parameter pte is associated with folio */
>>> +	ptent = ptep_get(pte);
>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>> +		return false;
>>> +
>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>
>> I think this could wander off the front or back of the pte table into arbitrary
>> memory if the folio is strddling multiple pte tables.
>>
>>> +	for (i = 0; i < nr; i++, pte++) {
>>> +		ptent = ptep_get(pte);
>>> +
>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>
>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
>> redundant?
> I think you are right. pte_none() can be removed here.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>> +			return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static inline bool
>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>> +{
>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>> +}
>>> +
>>>  /*
>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>   * should be called with vma's mmap_lock held for read or write,
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-03 13:20       ` Ryan Roberts
@ 2023-08-03 23:15         ` Yin, Fengwei
  2023-08-04  8:46           ` Ryan Roberts
  0 siblings, 1 reply; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-03 23:15 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd



On 8/3/2023 9:20 PM, Ryan Roberts wrote:
> On 03/08/2023 11:48, Yin Fengwei wrote:
>>
>>
>> On 8/3/23 17:58, Ryan Roberts wrote:
>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>> It will be used to check whether the folio is mapped to specific
>>>> VMA and whether the mapping address of folio is in the range.
>>>>
>>>> Also a helper function folio_within_vma() to check whether folio
>>>> is in the range of vma based on folio_in_range().
>>>>
>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 69 insertions(+)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 5a03bc4782a2..63de32154a48 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>  				   bool write, int *locked);
>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>  			       unsigned long bytes);
>>>> +
>>>
>>> Hi Yin,
>>>
>>> I wanted to take a step back and consolidate my concerns about this function. I
>>> should say that my understanding of the pgoff and index stuff is shaky and I
>>> could therefore be wrong about some of this; if this is the case, then sorry for
>>> the noise! But something about this function doesn't smell right to me, so I
>>> figure its better to raise it...
>> No worries. And thank you for looking at the code ans share the comments.
>> That helps me a lot. Really appreciate it.
>>
>>>
>>>> +/*
>>>> + * Check whether the folio is in specific range
>>>
>>> What exactly is the function trying to do? I *think* the intention is to figure
>>> out if a folio is fully and contiguously mapped within a range of virtual
>>> addresses belonging to a specific virtual address space? And I assume you want
>>> the answer to be precise? I'm assuming 'yes' for the below.
>>>
>>> I think the only way to do this is to actually check each PTE. And that causes a
>>> problem, because a folio can straddle multiple PTE tables, which causes PTL
>>> locking issues, and means having just a *pte pointer is insufficient if we need
>>> to traverse multiple pte tables. So perhaps you need to allow for a false
>>> negative in the case that the folio is not contained within a single pte table.
>> Let's check the users of this function first:
>>  mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
>>  madvise: Needs to check whether the folio is in the range of VMA and
>>           range [start, end).
> 
> Sure but those 2 ranges (the vma and the user-supplied range) are known to
> intersect, right? So really there is only one range of interest; the
> intersection. I would argue that should be done in a higher level wrapper, and
> is not part of the core primitive to work out if a folio is mapped to a
> particular range of virtual addresses.
>> This range is from user space. It could contain
>>           VMA range (in this case, we only need to check VMA range) or is contained
>>           by VMA range.
>>
>> So we check:
>>   1. Whether the folio is in the range.
>>
>>      To do this, we need to first check whether the folio->index is in the
>>      VMA range. If not, we know the folio is not in VMA range. Just return false
>>      without further check.
> 
> Right, so if the check fails, the folio is definitely not mapped by the vma, but
> if it passes, it *might* be. Further checks are required. So why is this useful?
> Why not just do the check that gives you the precise answer and skip this?
What check can give precise answer? I don't think checking PTE is right at this
stage. For following case (assume the folio is mapped in same page table and just
care about the VMA2 range):

|-----------vma1-----------|------------vma2---------|
                    |-------folio-----|

How can we tell whether the folio is in the range of VMA2 by just checking PTE?
And in this case, why not bail out if we see folio is out range of VMA2?


So only need to check PTE if we are sure the folio is in the range of VMA2:
|-----------vma1-----------|------------vma2---------|
                              |-------folio-----|

> 
>>
>>      Then, we will check whether the folio is in the range which is defined as
>>      [max(start, vma->vm_start), min(end, vma->vm_end)).
> 
> As par above comment, I would personally factor this out of the primitive.
> 
>>
>>
>>      This is the version in RFC as I was not aware of mremap case and forgot the
>>      page cache case. So if no mremap with an anonymous folio, this check is enough.
>>      But we need to add PTE check for mremap and page cache cases.
>>
>>   2. Check PTE for mremap in middle of large folio and page cache
>>      If the folio is normal 4K and if the folio is in the range, it's not possible
>>      the folio is partially mapped to two different VMA. So we are sure this folio
>>      is in the range.
> 
> But you test for small folio and exit early without checking the PTE. Given the
> index check only told you that the folio *might be* mapped, I don't see how you
> can return true at this point for a small folio, without looking at the PTE?
Yes. We can do this. You can check the discussion on the RFC version. I did move
the normal 4K folio check out of this function. Yu and Huge suggested to just ignore
the normal 4K as this API can' handle it.

The thing changed from RFC is that we need to check PTE here. I tried to avoid to
check PTE even for normal 4K folio.

> 
> folio->index is just the page offset within the file it maps (or the (original)
> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset
> within the file that the vma starts at. So you could have a folio that contains
> the contents of 1 file and a vma that maps another file, and the folio's index
> might fall within the VMA, but it doesn't mean the folio is mapped by the vma;
> its a different file.
> 
> Or perhaps there is an assumption in the function's spec that the folio is known
> to have at least one page mapped in the vma? That would change things... but you
> should make that very clear in the spec. And in that case, you can move the
> small folio check to the start and return true immediately.
Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in
the comment.


Regards
Yin, Fengwei

> 
>>
>>      Currently, check PTE didn't handle the case you mentioned. But my plan is
>>      checking whether the folio is mapped cross page table (by checking whether
>>      the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
>>      If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
>>      current check because we know the PTE pointer is always valid.
>>
>>      Obviously, the PTE checking needs hold pte lock as you pointed out.
>>
>>
>> My understanding:
>> pgoff is important if we have folio and VMA and want to know the virtual address of
>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
>> vaddr by:
>>       addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
>> implementation.
> 
> Yes but the rmap is only traversing vmas that are already known to map the same
> file that the folio contains pages for. (See rmap_walk_file(): it grabs the
> "mapping" from the folio, then traverses the list of vmas within the range of
> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon
> memory but I think the princple is the same.
> 
>>
>>>
>>>> + *
>>>> + * First, check whether the folio is in the range of vma.
>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>> + *
>>>> + * @pte page table pointer will be checked whether the large folio
>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>> + *      large folio, the large folio could be mapped to to different
>>>> + *      VMA and address check can't identify this situation.
>>>> + */
>>>> +static inline bool
>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>
>>> The prototype looks odd to me; Fundamentally it seems to me that you need the
>>> folio, start and end virtual addresses (the range you want to check that it is
>>> in), the pte pointer and the virtual address that the pte represents. That
>>> virtual address allows you to figure out the offset between the pa and va. Then
>>> you can look at all the ptes to figure out if they reference the folio's pfns,
>>> and use the va to pa mapping info to figure out if its within the specified range.
>>>
>>> I don't really understand why the vma is useful.
>>>
>>>> +{
>>>> +	pte_t ptent;
>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>> +	pgoff_t pgoff, addr;
>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>> +
>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>> +
>>>> +	if (start < vma->vm_start)
>>>> +		start = vma->vm_start;
>>>> +	if (end > vma->vm_end)
>>>> +		end = vma->vm_end;
>>>> +
>>>> +	pgoff = folio_pgoff(folio);
>>>> +	/* if folio start address is not in vma range */
>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>> +		return false;
>>>
>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio
>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
>>> might be referring to a completely different file than the vma's vm_pgoff. So
>>> you will get spurious results.
>>>
>>>> +
>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>> +		return false;
>>>> +
>>>> +	/* not necessary to check pte for none large folio */
>>>> +	if (!folio_test_large(folio))
>>>> +		return true;
>>>
>>> I don't understand why you don't need to check the pte for a small folio? As
>>> above, if the folio doesn't belong to the file that the vma is mapping the above
>>> checks seem wrong and you can't conclude that the folio is mapped in the range
>>> without looking at the pte?
>>>
>>>> +
>>>> +	if (!pte)
>>>> +		return false;
>>>> +
>>>> +	/* check whether parameter pte is associated with folio */
>>>> +	ptent = ptep_get(pte);
>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>> +		return false;
>>>> +
>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>
>>> I think this could wander off the front or back of the pte table into arbitrary
>>> memory if the folio is strddling multiple pte tables.
>>>
>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>> +		ptent = ptep_get(pte);
>>>> +
>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>
>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
>>> redundant?
>> I think you are right. pte_none() can be removed here.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> +			return false;
>>>> +	}
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>> +{
>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>> +}
>>>> +
>>>>  /*
>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>   * should be called with vma's mmap_lock held for read or write,
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-03 23:15         ` Yin, Fengwei
@ 2023-08-04  8:46           ` Ryan Roberts
  2023-08-04  9:05             ` Yin, Fengwei
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2023-08-04  8:46 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

On 04/08/2023 00:15, Yin, Fengwei wrote:
> 
> 
> On 8/3/2023 9:20 PM, Ryan Roberts wrote:
>> On 03/08/2023 11:48, Yin Fengwei wrote:
>>>
>>>
>>> On 8/3/23 17:58, Ryan Roberts wrote:
>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>> It will be used to check whether the folio is mapped to specific
>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>
>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>> is in the range of vma based on folio_in_range().
>>>>>
>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>> ---
>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 69 insertions(+)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>  				   bool write, int *locked);
>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>  			       unsigned long bytes);
>>>>> +
>>>>
>>>> Hi Yin,
>>>>
>>>> I wanted to take a step back and consolidate my concerns about this function. I
>>>> should say that my understanding of the pgoff and index stuff is shaky and I
>>>> could therefore be wrong about some of this; if this is the case, then sorry for
>>>> the noise! But something about this function doesn't smell right to me, so I
>>>> figure its better to raise it...
>>> No worries. And thank you for looking at the code ans share the comments.
>>> That helps me a lot. Really appreciate it.
>>>
>>>>
>>>>> +/*
>>>>> + * Check whether the folio is in specific range
>>>>
>>>> What exactly is the function trying to do? I *think* the intention is to figure
>>>> out if a folio is fully and contiguously mapped within a range of virtual
>>>> addresses belonging to a specific virtual address space? And I assume you want
>>>> the answer to be precise? I'm assuming 'yes' for the below.
>>>>
>>>> I think the only way to do this is to actually check each PTE. And that causes a
>>>> problem, because a folio can straddle multiple PTE tables, which causes PTL
>>>> locking issues, and means having just a *pte pointer is insufficient if we need
>>>> to traverse multiple pte tables. So perhaps you need to allow for a false
>>>> negative in the case that the folio is not contained within a single pte table.
>>> Let's check the users of this function first:
>>>  mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
>>>  madvise: Needs to check whether the folio is in the range of VMA and
>>>           range [start, end).
>>
>> Sure but those 2 ranges (the vma and the user-supplied range) are known to
>> intersect, right? So really there is only one range of interest; the
>> intersection. I would argue that should be done in a higher level wrapper, and
>> is not part of the core primitive to work out if a folio is mapped to a
>> particular range of virtual addresses.
>>> This range is from user space. It could contain
>>>           VMA range (in this case, we only need to check VMA range) or is contained
>>>           by VMA range.
>>>
>>> So we check:
>>>   1. Whether the folio is in the range.
>>>
>>>      To do this, we need to first check whether the folio->index is in the
>>>      VMA range. If not, we know the folio is not in VMA range. Just return false
>>>      without further check.
>>
>> Right, so if the check fails, the folio is definitely not mapped by the vma, but
>> if it passes, it *might* be. Further checks are required. So why is this useful?
>> Why not just do the check that gives you the precise answer and skip this?
> What check can give precise answer? I don't think checking PTE is right at this
> stage. For following case (assume the folio is mapped in same page table and just
> care about the VMA2 range):
> 
> |-----------vma1-----------|------------vma2---------|
>                     |-------folio-----|
> 
> How can we tell whether the folio is in the range of VMA2 by just checking PTE?
> And in this case, why not bail out if we see folio is out range of VMA2?
> 
> 
> So only need to check PTE if we are sure the folio is in the range of VMA2:
> |-----------vma1-----------|------------vma2---------|
>                               |-------folio-----|
> 
>>
>>>
>>>      Then, we will check whether the folio is in the range which is defined as
>>>      [max(start, vma->vm_start), min(end, vma->vm_end)).
>>
>> As par above comment, I would personally factor this out of the primitive.
>>
>>>
>>>
>>>      This is the version in RFC as I was not aware of mremap case and forgot the
>>>      page cache case. So if no mremap with an anonymous folio, this check is enough.
>>>      But we need to add PTE check for mremap and page cache cases.
>>>
>>>   2. Check PTE for mremap in middle of large folio and page cache
>>>      If the folio is normal 4K and if the folio is in the range, it's not possible
>>>      the folio is partially mapped to two different VMA. So we are sure this folio
>>>      is in the range.
>>
>> But you test for small folio and exit early without checking the PTE. Given the
>> index check only told you that the folio *might be* mapped, I don't see how you
>> can return true at this point for a small folio, without looking at the PTE?
> Yes. We can do this. You can check the discussion on the RFC version. I did move
> the normal 4K folio check out of this function. Yu and Huge suggested to just ignore
> the normal 4K as this API can' handle it.
> 
> The thing changed from RFC is that we need to check PTE here. I tried to avoid to
> check PTE even for normal 4K folio.

I'll go read the RFC. I've made my point; if you and others are convinced this
is correct, then fair enough.

> 
>>
>> folio->index is just the page offset within the file it maps (or the (original)
>> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset
>> within the file that the vma starts at. So you could have a folio that contains
>> the contents of 1 file and a vma that maps another file, and the folio's index
>> might fall within the VMA, but it doesn't mean the folio is mapped by the vma;
>> its a different file.
>>
>> Or perhaps there is an assumption in the function's spec that the folio is known
>> to have at least one page mapped in the vma? That would change things... but you
>> should make that very clear in the spec. And in that case, you can move the
>> small folio check to the start and return true immediately.
> Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in
> the comment.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>>
>>>      Currently, check PTE didn't handle the case you mentioned. But my plan is
>>>      checking whether the folio is mapped cross page table (by checking whether
>>>      the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
>>>      If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
>>>      current check because we know the PTE pointer is always valid.
>>>
>>>      Obviously, the PTE checking needs hold pte lock as you pointed out.
>>>
>>>
>>> My understanding:
>>> pgoff is important if we have folio and VMA and want to know the virtual address of
>>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
>>> vaddr by:
>>>       addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
>>> implementation.
>>
>> Yes but the rmap is only traversing vmas that are already known to map the same
>> file that the folio contains pages for. (See rmap_walk_file(): it grabs the
>> "mapping" from the folio, then traverses the list of vmas within the range of
>> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon
>> memory but I think the princple is the same.
>>
>>>
>>>>
>>>>> + *
>>>>> + * First, check whether the folio is in the range of vma.
>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>> + *
>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>> + *      large folio, the large folio could be mapped to to different
>>>>> + *      VMA and address check can't identify this situation.
>>>>> + */
>>>>> +static inline bool
>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>
>>>> The prototype looks odd to me; Fundamentally it seems to me that you need the
>>>> folio, start and end virtual addresses (the range you want to check that it is
>>>> in), the pte pointer and the virtual address that the pte represents. That
>>>> virtual address allows you to figure out the offset between the pa and va. Then
>>>> you can look at all the ptes to figure out if they reference the folio's pfns,
>>>> and use the va to pa mapping info to figure out if its within the specified range.
>>>>
>>>> I don't really understand why the vma is useful.
>>>>
>>>>> +{
>>>>> +	pte_t ptent;
>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>> +	pgoff_t pgoff, addr;
>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>> +
>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>> +
>>>>> +	if (start < vma->vm_start)
>>>>> +		start = vma->vm_start;
>>>>> +	if (end > vma->vm_end)
>>>>> +		end = vma->vm_end;
>>>>> +
>>>>> +	pgoff = folio_pgoff(folio);
>>>>> +	/* if folio start address is not in vma range */
>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>> +		return false;
>>>>
>>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio
>>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
>>>> might be referring to a completely different file than the vma's vm_pgoff. So
>>>> you will get spurious results.
>>>>
>>>>> +
>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>> +		return false;
>>>>> +
>>>>> +	/* not necessary to check pte for none large folio */
>>>>> +	if (!folio_test_large(folio))
>>>>> +		return true;
>>>>
>>>> I don't understand why you don't need to check the pte for a small folio? As
>>>> above, if the folio doesn't belong to the file that the vma is mapping the above
>>>> checks seem wrong and you can't conclude that the folio is mapped in the range
>>>> without looking at the pte?
>>>>
>>>>> +
>>>>> +	if (!pte)
>>>>> +		return false;
>>>>> +
>>>>> +	/* check whether parameter pte is associated with folio */
>>>>> +	ptent = ptep_get(pte);
>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>> +		return false;
>>>>> +
>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>
>>>> I think this could wander off the front or back of the pte table into arbitrary
>>>> memory if the folio is strddling multiple pte tables.
>>>>
>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>> +		ptent = ptep_get(pte);
>>>>> +
>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>
>>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
>>>> redundant?
>>> I think you are right. pte_none() can be removed here.
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> +			return false;
>>>>> +	}
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>> +static inline bool
>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>> +{
>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-04  8:46           ` Ryan Roberts
@ 2023-08-04  9:05             ` Yin, Fengwei
  0 siblings, 0 replies; 26+ messages in thread
From: Yin, Fengwei @ 2023-08-04  9:05 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, david,
	shy828301, hughd

Hi Ryan,

On 8/4/2023 4:46 PM, Ryan Roberts wrote:
> On 04/08/2023 00:15, Yin, Fengwei wrote:
>>
>>
>> On 8/3/2023 9:20 PM, Ryan Roberts wrote:
>>> On 03/08/2023 11:48, Yin Fengwei wrote:
>>>>
>>>>
>>>> On 8/3/23 17:58, Ryan Roberts wrote:
>>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>>> It will be used to check whether the folio is mapped to specific
>>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>>
>>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>>> is in the range of vma based on folio_in_range().
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> ---
>>>>>>  mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 69 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>>  				   bool write, int *locked);
>>>>>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>>  			       unsigned long bytes);
>>>>>> +
>>>>>
>>>>> Hi Yin,
>>>>>
>>>>> I wanted to take a step back and consolidate my concerns about this function. I
>>>>> should say that my understanding of the pgoff and index stuff is shaky and I
>>>>> could therefore be wrong about some of this; if this is the case, then sorry for
>>>>> the noise! But something about this function doesn't smell right to me, so I
>>>>> figure its better to raise it...
>>>> No worries. And thank you for looking at the code ans share the comments.
>>>> That helps me a lot. Really appreciate it.
>>>>
>>>>>
>>>>>> +/*
>>>>>> + * Check whether the folio is in specific range
>>>>>
>>>>> What exactly is the function trying to do? I *think* the intention is to figure
>>>>> out if a folio is fully and contiguously mapped within a range of virtual
>>>>> addresses belonging to a specific virtual address space? And I assume you want
>>>>> the answer to be precise? I'm assuming 'yes' for the below.
>>>>>
>>>>> I think the only way to do this is to actually check each PTE. And that causes a
>>>>> problem, because a folio can straddle multiple PTE tables, which causes PTL
>>>>> locking issues, and means having just a *pte pointer is insufficient if we need
>>>>> to traverse multiple pte tables. So perhaps you need to allow for a false
>>>>> negative in the case that the folio is not contained within a single pte table.
>>>> Let's check the users of this function first:
>>>>  mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
>>>>  madvise: Needs to check whether the folio is in the range of VMA and
>>>>           range [start, end).
>>>
>>> Sure but those 2 ranges (the vma and the user-supplied range) are known to
>>> intersect, right? So really there is only one range of interest; the
>>> intersection. I would argue that should be done in a higher level wrapper, and
>>> is not part of the core primitive to work out if a folio is mapped to a
>>> particular range of virtual addresses.
>>>> This range is from user space. It could contain
>>>>           VMA range (in this case, we only need to check VMA range) or is contained
>>>>           by VMA range.
>>>>
>>>> So we check:
>>>>   1. Whether the folio is in the range.
>>>>
>>>>      To do this, we need to first check whether the folio->index is in the
>>>>      VMA range. If not, we know the folio is not in VMA range. Just return false
>>>>      without further check.
>>>
>>> Right, so if the check fails, the folio is definitely not mapped by the vma, but
>>> if it passes, it *might* be. Further checks are required. So why is this useful?
>>> Why not just do the check that gives you the precise answer and skip this?
>> What check can give precise answer? I don't think checking PTE is right at this
>> stage. For following case (assume the folio is mapped in same page table and just
>> care about the VMA2 range):
>>
>> |-----------vma1-----------|------------vma2---------|
>>                     |-------folio-----|
>>
>> How can we tell whether the folio is in the range of VMA2 by just checking PTE?
>> And in this case, why not bail out if we see folio is out range of VMA2?
>>
>>
>> So only need to check PTE if we are sure the folio is in the range of VMA2:
>> |-----------vma1-----------|------------vma2---------|
>>                               |-------folio-----|
>>
>>>
>>>>
>>>>      Then, we will check whether the folio is in the range which is defined as
>>>>      [max(start, vma->vm_start), min(end, vma->vm_end)).
>>>
>>> As par above comment, I would personally factor this out of the primitive.
>>>
>>>>
>>>>
>>>>      This is the version in RFC as I was not aware of mremap case and forgot the
>>>>      page cache case. So if no mremap with an anonymous folio, this check is enough.
>>>>      But we need to add PTE check for mremap and page cache cases.
>>>>
>>>>   2. Check PTE for mremap in middle of large folio and page cache
>>>>      If the folio is normal 4K and if the folio is in the range, it's not possible
>>>>      the folio is partially mapped to two different VMA. So we are sure this folio
>>>>      is in the range.
>>>
>>> But you test for small folio and exit early without checking the PTE. Given the
>>> index check only told you that the folio *might be* mapped, I don't see how you
>>> can return true at this point for a small folio, without looking at the PTE?
>> Yes. We can do this. You can check the discussion on the RFC version. I did move
>> the normal 4K folio check out of this function. Yu and Huge suggested to just ignore
>> the normal 4K as this API can' handle it.
>>
>> The thing changed from RFC is that we need to check PTE here. I tried to avoid to
>> check PTE even for normal 4K folio.
> 
> I'll go read the RFC. I've made my point; if you and others are convinced this
> is correct, then fair enough.
Thanks a lot for your comments. I will work on new version based on your and Andrew's
comments.

Sometimes, I blindly assume other people know the assumptions (like at least one page
mapped in the VMA) I made. Which did make the review harder. I will try to improve
on this in the future.

Regards
Yin, Fengwei

> 
>>
>>>
>>> folio->index is just the page offset within the file it maps (or the (original)
>>> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset
>>> within the file that the vma starts at. So you could have a folio that contains
>>> the contents of 1 file and a vma that maps another file, and the folio's index
>>> might fall within the VMA, but it doesn't mean the folio is mapped by the vma;
>>> its a different file.
>>>
>>> Or perhaps there is an assumption in the function's spec that the folio is known
>>> to have at least one page mapped in the vma? That would change things... but you
>>> should make that very clear in the spec. And in that case, you can move the
>>> small folio check to the start and return true immediately.
>> Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in
>> the comment.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>>      Currently, check PTE didn't handle the case you mentioned. But my plan is
>>>>      checking whether the folio is mapped cross page table (by checking whether
>>>>      the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
>>>>      If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
>>>>      current check because we know the PTE pointer is always valid.
>>>>
>>>>      Obviously, the PTE checking needs hold pte lock as you pointed out.
>>>>
>>>>
>>>> My understanding:
>>>> pgoff is important if we have folio and VMA and want to know the virtual address of
>>>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
>>>> vaddr by:
>>>>       addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
>>>> implementation.
>>>
>>> Yes but the rmap is only traversing vmas that are already known to map the same
>>> file that the folio contains pages for. (See rmap_walk_file(): it grabs the
>>> "mapping" from the folio, then traverses the list of vmas within the range of
>>> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon
>>> memory but I think the princple is the same.
>>>
>>>>
>>>>>
>>>>>> + *
>>>>>> + * First, check whether the folio is in the range of vma.
>>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>>> + *
>>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>>> + *      is fully mapped to. Currently, if mremap in the middle of
>>>>>> + *      large folio, the large folio could be mapped to to different
>>>>>> + *      VMA and address check can't identify this situation.
>>>>>> + */
>>>>>> +static inline bool
>>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>>> +		unsigned long start, unsigned long end, pte_t *pte)
>>>>>
>>>>> The prototype looks odd to me; Fundamentally it seems to me that you need the
>>>>> folio, start and end virtual addresses (the range you want to check that it is
>>>>> in), the pte pointer and the virtual address that the pte represents. That
>>>>> virtual address allows you to figure out the offset between the pa and va. Then
>>>>> you can look at all the ptes to figure out if they reference the folio's pfns,
>>>>> and use the va to pa mapping info to figure out if its within the specified range.
>>>>>
>>>>> I don't really understand why the vma is useful.
>>>>>
>>>>>> +{
>>>>>> +	pte_t ptent;
>>>>>> +	unsigned long i, nr = folio_nr_pages(folio);
>>>>>> +	pgoff_t pgoff, addr;
>>>>>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>>> +
>>>>>> +	if (start < vma->vm_start)
>>>>>> +		start = vma->vm_start;
>>>>>> +	if (end > vma->vm_end)
>>>>>> +		end = vma->vm_end;
>>>>>> +
>>>>>> +	pgoff = folio_pgoff(folio);
>>>>>> +	/* if folio start address is not in vma range */
>>>>>> +	if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>>> +		return false;
>>>>>
>>>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio
>>>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
>>>>> might be referring to a completely different file than the vma's vm_pgoff. So
>>>>> you will get spurious results.
>>>>>
>>>>>> +
>>>>>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>> +	if (addr < start || end - addr < folio_size(folio))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* not necessary to check pte for none large folio */
>>>>>> +	if (!folio_test_large(folio))
>>>>>> +		return true;
>>>>>
>>>>> I don't understand why you don't need to check the pte for a small folio? As
>>>>> above, if the folio doesn't belong to the file that the vma is mapping the above
>>>>> checks seem wrong and you can't conclude that the folio is mapped in the range
>>>>> without looking at the pte?
>>>>>
>>>>>> +
>>>>>> +	if (!pte)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* check whether parameter pte is associated with folio */
>>>>>> +	ptent = ptep_get(pte);
>>>>>> +	if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> +			pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>>
>>>>> I think this could wander off the front or back of the pte table into arbitrary
>>>>> memory if the folio is strddling multiple pte tables.
>>>>>
>>>>>> +	for (i = 0; i < nr; i++, pte++) {
>>>>>> +		ptent = ptep_get(pte);
>>>>>> +
>>>>>> +		if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> +				pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>
>>>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
>>>>> redundant?
>>>> I think you are right. pte_none() can be removed here.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> +			return false;
>>>>>> +	}
>>>>>> +
>>>>>> +	return true;
>>>>>> +}
>>>>>> +
>>>>>> +static inline bool
>>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>>> +{
>>>>>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>>> +}
>>>>>> +
>>>>>>  /*
>>>>>>   * mlock_vma_folio() and munlock_vma_folio():
>>>>>>   * should be called with vma's mmap_lock held for read or write,
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-08-04  9:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-28  7:09 [PATCH 0/3] support large folio for mlock Yin Fengwei
2023-07-28  7:09 ` [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
2023-07-28 18:34   ` Andrew Morton
2023-07-29 13:54     ` Yin, Fengwei
2023-08-02 11:14   ` Ryan Roberts
2023-08-02 11:35     ` Ryan Roberts
2023-08-02 13:12       ` Yin, Fengwei
2023-08-02 15:12         ` Ryan Roberts
2023-08-03  1:36           ` Yin Fengwei
2023-08-02 12:50     ` Yin, Fengwei
2023-08-02 13:09       ` Ryan Roberts
2023-08-02 13:46         ` Yin, Fengwei
2023-08-02 14:08           ` Ryan Roberts
2023-08-02 14:14             ` Yin, Fengwei
2023-08-02 14:59               ` Ryan Roberts
2023-08-03  0:24                 ` Yin Fengwei
2023-08-02 15:15   ` Ryan Roberts
2023-08-03  0:41     ` Yin Fengwei
2023-08-03  9:58   ` Ryan Roberts
2023-08-03 10:48     ` Yin Fengwei
2023-08-03 13:20       ` Ryan Roberts
2023-08-03 23:15         ` Yin, Fengwei
2023-08-04  8:46           ` Ryan Roberts
2023-08-04  9:05             ` Yin, Fengwei
2023-07-28  7:09 ` [PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-07-28  7:09 ` [PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).