[PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
@ 2023-06-02 23:05 Peter Xu
  2023-06-02 23:05 ` [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable() Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-02 23:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, peterx, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

When hit pmd_trans_unstable() under mmap read lock, it means we raced with
something else.  Per the comment above the helper, we can definitely treat
it as some pmd (none?) but the 100% correct way is always retry, and I
don't think it should race again in most cases.

Not taking care of that retry can mean different things on different
paths.

For example, for smaps it means inaccurate accountings when we skip those
raced regions, but it's fine anyway because the accounting is not for 100%
accurate.

I think it's broken for pagemap OTOH, because we have the pagemap buffer
linear to the VA we're scanning, it means if we skip some region the follow
up scans can fill in the wrong slots, I think.  It means the pagemap
results returned to userapp will be wrong when very unlucky.

This reminded me that I should have a look at all call sites of
pmd_trans_unstable(), some of them are alright but I do see many of them
may still be better to give another shot when hit.

This series tries to resolve all call sites for it on that retry attempt.

I really don't know whether I missed something, even if not, whether it
matters a lot to anyone.  Still, _if_ I'm correct may worth consider
fixing.  Happy to be prove wrong.  Then Muhammad should know how to code
his.

The patchset is only smoke tested, nothing wrong I see.

Please have a look, thanks.

Peter Xu (4):
  mm/mprotect: Retry on pmd_trans_unstable()
  mm/migrate: Unify and retry an unstable pmd when hit
  mm: Warn for unstable pmd in move_page_tables()
  mm: Make most walk page paths with pmd_trans_unstable() to retry

 fs/proc/task_mmu.c  | 17 +++++++++++++----
 mm/madvise.c        |  8 ++++++--
 mm/memcontrol.c     |  8 ++++++--
 mm/memory-failure.c |  4 +++-
 mm/mempolicy.c      |  4 +++-
 mm/migrate_device.c |  9 ++++-----
 mm/mprotect.c       | 20 +++++++++++---------
 mm/mremap.c         |  4 ++--
 8 files changed, 48 insertions(+), 26 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable()
  2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
@ 2023-06-02 23:05 ` Peter Xu
  2023-06-03  2:04   ` Yang Shi
  2023-06-02 23:05 ` [PATCH 2/4] mm/migrate: Unify and retry an unstable pmd when hit Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-06-02 23:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, peterx, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

When hit unstable pmd, we should retry the pmd once more because it means
we probably raced with a thp insertion.

Skipping it might be a problem as no error will be reported to the caller.
I assume it means the user will expect prot changed (e.g. mprotect or
userfaultfd wr-protections) applied but it's actually not.

To achieve it, move the pmd_trans_unstable() call out of change_pte_range()
which will make the retry easier, as we can keep the retval of
change_pte_range() untouched.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..e4756899d40c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -94,15 +94,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
-	/*
-	 * Can be called with only the mmap_lock for reading by
-	 * prot_numa so we must check the pmd isn't constantly
-	 * changing from under us from pmd_none to pmd_trans_huge
-	 * and/or the other way around.
-	 */
-	if (pmd_trans_unstable(pmd))
-		return 0;
-
 	/*
 	 * The pmd points to a regular pte so the pmd can't change
 	 * from under us even if the mmap_lock is only hold for
@@ -411,6 +402,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			pages = ret;
 			break;
 		}
+again:
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -465,6 +457,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
+
+		/*
+		 * Can be called with only the mmap_lock for reading by
+		 * prot_numa or userfaultfd-wp, so we must check the pmd
+		 * isn't constantly changing from under us from pmd_none to
+		 * pmd_trans_huge and/or the other way around.
+		 */
+		if (pmd_trans_unstable(pmd))
+			goto again;
+
 		pages += change_pte_range(tlb, vma, pmd, addr, next,
 					  newprot, cp_flags);
 next:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] mm/migrate: Unify and retry an unstable pmd when hit
  2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
  2023-06-02 23:05 ` [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable() Peter Xu
@ 2023-06-02 23:05 ` Peter Xu
  2023-06-02 23:05 ` [PATCH 3/4] mm: Warn for unstable pmd in move_page_tables() Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-02 23:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, peterx, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

There's one pmd_bad() check, but should be better to use pmd_clear_bad()
which is part of pmd_trans_unstable().

And I assume that's not enough, because there can be race of thp insertion
when reaching pmd_bad(), so it can be !bad but a thp, then the walk is
illegal.

There's one case though where the function used pmd_trans_unstable() but
only for pmd split.  Merge them into one, and if it happens retry the whole
pmd.

Cc: Alistair Popple <apopple@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate_device.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..6fc54c053c05 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmdp, addr);
-			if (pmd_trans_unstable(pmdp))
-				return migrate_vma_collect_skip(start, end,
-								walk);
 		} else {
 			int ret;
 
@@ -106,8 +103,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 	}
 
-	if (unlikely(pmd_bad(*pmdp)))
-		return migrate_vma_collect_skip(start, end, walk);
+	if (unlikely(pmd_trans_unstable(pmdp))) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] mm: Warn for unstable pmd in move_page_tables()
  2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
  2023-06-02 23:05 ` [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable() Peter Xu
  2023-06-02 23:05 ` [PATCH 2/4] mm/migrate: Unify and retry an unstable pmd when hit Peter Xu
@ 2023-06-02 23:05 ` Peter Xu
  2023-06-02 23:05 ` [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry Peter Xu
  2023-06-07 13:49 ` [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
  4 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-02 23:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, peterx, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

We already hold write mmap lock here, not possible to trigger unstable
pmd.  Make it a WARN_ON_ONCE instead.

Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index da107f2c71bf..9303e4da4e7f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -544,8 +544,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 					   old_pmd, new_pmd, need_rmap_locks))
 				continue;
 			split_huge_pmd(vma, old_pmd, old_addr);
-			if (pmd_trans_unstable(old_pmd))
-				continue;
+			/* We're with the mmap write lock, not possible to happen */
+			WARN_ON_ONCE(pmd_trans_unstable(old_pmd));
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
 			   extent == PMD_SIZE) {
 			/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry
  2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
                   ` (2 preceding siblings ...)
  2023-06-02 23:05 ` [PATCH 3/4] mm: Warn for unstable pmd in move_page_tables() Peter Xu
@ 2023-06-02 23:05 ` Peter Xu
  2023-06-05 18:46   ` Yang Shi
  2023-06-07 13:49 ` [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-06-02 23:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, peterx, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

For most of the page walk paths, logically it'll always be good to have the
pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
pmd (per comment above pmd_trans_unstable()), but in most cases we're not
even treating that as a none pmd.  If to fix it anyway, a retry will be the
most accurate.

I've went over all the pmd_trans_unstable() special cases and this patch
should cover all the rest places where we should retry properly with
unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
easily achieve that.

These are the call sites that I think should be fixed with it:

*** fs/proc/task_mmu.c:
smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
*** mm/memcontrol.c:
mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
*** mm/memory-failure.c:
hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
*** mm/mempolicy.c:
queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
*** mm/madvise.c:
madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))

IIUC most of them may or may not be a big issue even without a retry,
either because they're already not strict (smaps, pte_stats, MADV_COLD,
.. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
cold worst case), but some of them could have functional error without the
retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
the unstable pmd range.. so IIUC the pagemap result can be wrong).

While these call sites all look fine, and don't need any change:

*** include/linux/pgtable.h:
pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
*** mm/gup.c:
follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
*** mm/mapping_dirty_helpers.c:
wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
*** mm/memory.c:
do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
*** mm/migrate_device.c:
migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
*** mm/mincore.c:
mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/proc/task_mmu.c  | 17 +++++++++++++----
 mm/madvise.c        |  8 ++++++--
 mm/memcontrol.c     |  8 ++++++--
 mm/memory-failure.c |  4 +++-
 mm/mempolicy.c      |  4 +++-
 5 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6259dd432eeb..823eaba5c6bf 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		goto out;
+	}
+
 	/*
 	 * The mmap_lock held all the way back in m_start() is what
 	 * keeps khugepaged out of here and from collapsing things
@@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		return err;
 	}
 
-	if (pmd_trans_unstable(pmdp))
+	if (pmd_trans_unstable(pmdp)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 	/*
@@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 #endif
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
diff --git a/mm/madvise.c b/mm/madvise.c
index 78cd12581628..0fd81712022c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 regular_folio:
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 #endif
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ee433be4c3b..15e50f033e41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (get_mctgt_type(vma, addr, *pte, NULL))
@@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 retry:
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 004a02f44271..c97fb2b7ab4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmdp))
+	if (pmd_trans_unstable(pmdp)) {
+		walk->action = ACTION_AGAIN;
 		goto out;
+	}
 
 	mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
 						addr, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f06ca8c18e62..af8907b4aad1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 	if (ptl)
 		return queue_folios_pmd(pmd, ptl, addr, end, walk);
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd)) {
+		walk->action = ACTION_AGAIN;
 		return 0;
+	}
 
 	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable()
  2023-06-02 23:05 ` [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable() Peter Xu
@ 2023-06-03  2:04   ` Yang Shi
  2023-06-04 23:58     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2023-06-03  2:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>
> When hit unstable pmd, we should retry the pmd once more because it means
> we probably raced with a thp insertion.
>
> Skipping it might be a problem as no error will be reported to the caller.
> I assume it means the user will expect prot changed (e.g. mprotect or
> userfaultfd wr-protections) applied but it's actually not.

IIRC, mprotect() holds write mmap_lock, so it should not matter. PROT
NUMA holds read mmap_lock, but returning 0 also doesn't matter (of
course retry is fine too). just skip that 2M area. The userfaultfd-wp
is your call :-)

>
> To achieve it, move the pmd_trans_unstable() call out of change_pte_range()
> which will make the retry easier, as we can keep the retval of
> change_pte_range() untouched.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/mprotect.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 92d3d3ca390a..e4756899d40c 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -94,15 +94,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>
>         tlb_change_page_size(tlb, PAGE_SIZE);
>
> -       /*
> -        * Can be called with only the mmap_lock for reading by
> -        * prot_numa so we must check the pmd isn't constantly
> -        * changing from under us from pmd_none to pmd_trans_huge
> -        * and/or the other way around.
> -        */
> -       if (pmd_trans_unstable(pmd))
> -               return 0;
> -
>         /*
>          * The pmd points to a regular pte so the pmd can't change
>          * from under us even if the mmap_lock is only hold for
> @@ -411,6 +402,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>                         pages = ret;
>                         break;
>                 }
> +again:
>                 /*
>                  * Automatic NUMA balancing walks the tables with mmap_lock
>                  * held for read. It's possible a parallel update to occur
> @@ -465,6 +457,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>                         }
>                         /* fall through, the trans huge pmd just split */
>                 }
> +
> +               /*
> +                * Can be called with only the mmap_lock for reading by
> +                * prot_numa or userfaultfd-wp, so we must check the pmd
> +                * isn't constantly changing from under us from pmd_none to
> +                * pmd_trans_huge and/or the other way around.
> +                */
> +               if (pmd_trans_unstable(pmd))
> +                       goto again;
> +
>                 pages += change_pte_range(tlb, vma, pmd, addr, next,
>                                           newprot, cp_flags);
>  next:
> --
> 2.40.1
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable()
  2023-06-03  2:04   ` Yang Shi
@ 2023-06-04 23:58     ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-04 23:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Fri, Jun 02, 2023 at 07:04:48PM -0700, Yang Shi wrote:
> On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > When hit unstable pmd, we should retry the pmd once more because it means
> > we probably raced with a thp insertion.
> >
> > Skipping it might be a problem as no error will be reported to the caller.
> > I assume it means the user will expect prot changed (e.g. mprotect or
> > userfaultfd wr-protections) applied but it's actually not.
> 
> IIRC, mprotect() holds write mmap_lock, so it should not matter. PROT
> NUMA holds read mmap_lock, but returning 0 also doesn't matter (of
> course retry is fine too). just skip that 2M area.

True.

> The userfaultfd-wp is your call :-)

Yeah I think uffd should still be a problem.  I'll reword the commit
message (by dropping mprotect example) in the new version.

If you have time feel free to have a look at patch 4, where I think it's a
bug for pagemap too (I didn't check as close as all the rest; the memcg one
might be suspecious, that's also in patch 4).

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry
  2023-06-02 23:05 ` [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry Peter Xu
@ 2023-06-05 18:46   ` Yang Shi
  2023-06-05 19:20     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2023-06-05 18:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>
> For most of the page walk paths, logically it'll always be good to have the
> pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
> pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> even treating that as a none pmd.  If to fix it anyway, a retry will be the
> most accurate.
>
> I've went over all the pmd_trans_unstable() special cases and this patch
> should cover all the rest places where we should retry properly with
> unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
> easily achieve that.
>
> These are the call sites that I think should be fixed with it:
>
> *** fs/proc/task_mmu.c:
> smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
> clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
> pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
> gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
> *** mm/memcontrol.c:
> mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> *** mm/memory-failure.c:
> hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
> *** mm/mempolicy.c:
> queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
> *** mm/madvise.c:
> madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))
>
> IIUC most of them may or may not be a big issue even without a retry,
> either because they're already not strict (smaps, pte_stats, MADV_COLD,
> .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> cold worst case), but some of them could have functional error without the
> retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> the unstable pmd range.. so IIUC the pagemap result can be wrong).
>
> While these call sites all look fine, and don't need any change:
>
> *** include/linux/pgtable.h:
> pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> *** mm/gup.c:
> follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
> *** mm/mapping_dirty_helpers.c:
> wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
> *** mm/memory.c:
> do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
> *** mm/migrate_device.c:
> migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
> *** mm/mincore.c:
> mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/proc/task_mmu.c  | 17 +++++++++++++----
>  mm/madvise.c        |  8 ++++++--
>  mm/memcontrol.c     |  8 ++++++--
>  mm/memory-failure.c |  4 +++-
>  mm/mempolicy.c      |  4 +++-
>  5 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6259dd432eeb..823eaba5c6bf 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>                 goto out;
>         }
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 goto out;
> +       }
> +
>         /*
>          * The mmap_lock held all the way back in m_start() is what
>          * keeps khugepaged out of here and from collapsing things
> @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>                 return 0;
>         }
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }
>
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; pte++, addr += PAGE_SIZE) {
> @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>                 return err;
>         }
>
> -       if (pmd_trans_unstable(pmdp))
> +       if (pmd_trans_unstable(pmdp)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;

Had a quick look at the pagemap code, I agree with your analysis,
"returning 0" may mess up pagemap, retry should be fine. But I'm
wondering whether we should just fill in empty entries. Anyway I don't
have a  strong opinion on this, just a little bit concerned by
potential indefinite retry.

> +       }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>         /*
> @@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>                 return 0;
>         }
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }
>  #endif
>         orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>         do {
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 78cd12581628..0fd81712022c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>         }
>
>  regular_folio:
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }
>  #endif
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                 if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
>                         goto next;
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }
>
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ee433be4c3b..15e50f033e41 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>                 return 0;
>         }
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;

Either retry or keep as is is fine to me. I'm not aware of anyone
complaining about noticeable inaccurate charges due to this. But we
may have potential indefinite retry anyway, so if you don't want to
risk this, it may be better just keep it as is.

> +       }
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; pte++, addr += PAGE_SIZE)
>                 if (get_mctgt_type(vma, addr, *pte, NULL))
> @@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>                 return 0;
>         }
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }
>  retry:
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; addr += PAGE_SIZE) {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 004a02f44271..c97fb2b7ab4a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
>                 goto out;
>         }
>
> -       if (pmd_trans_unstable(pmdp))
> +       if (pmd_trans_unstable(pmdp)) {
> +               walk->action = ACTION_AGAIN;
>                 goto out;
> +       }

This may be worth retrying IMHO.

>
>         mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
>                                                 addr, &ptl);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f06ca8c18e62..af8907b4aad1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>         if (ptl)
>                 return queue_folios_pmd(pmd, ptl, addr, end, walk);
>
> -       if (pmd_trans_unstable(pmd))
> +       if (pmd_trans_unstable(pmd)) {
> +               walk->action = ACTION_AGAIN;
>                 return 0;
> +       }

Either retry or keep it as is is fine.

>
>         mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>         for (; addr != end; pte++, addr += PAGE_SIZE) {
> --
> 2.40.1
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry
  2023-06-05 18:46   ` Yang Shi
@ 2023-06-05 19:20     ` Peter Xu
  2023-06-06 19:12       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-06-05 19:20 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > For most of the page walk paths, logically it'll always be good to have the
> > pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
> > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > even treating that as a none pmd.  If to fix it anyway, a retry will be the
> > most accurate.
> >
> > I've went over all the pmd_trans_unstable() special cases and this patch
> > should cover all the rest places where we should retry properly with
> > unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
> > easily achieve that.
> >
> > These are the call sites that I think should be fixed with it:
> >
> > *** fs/proc/task_mmu.c:
> > smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
> > clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
> > pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
> > gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
> > *** mm/memcontrol.c:
> > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > *** mm/memory-failure.c:
> > hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
> > *** mm/mempolicy.c:
> > queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
> > *** mm/madvise.c:
> > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))
> >
> > IIUC most of them may or may not be a big issue even without a retry,
> > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > cold worst case), but some of them could have functional error without the
> > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> >
> > While these call sites all look fine, and don't need any change:
> >
> > *** include/linux/pgtable.h:
> > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > *** mm/gup.c:
> > follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
> > *** mm/mapping_dirty_helpers.c:
> > wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
> > *** mm/memory.c:
> > do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > *** mm/migrate_device.c:
> > migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
> > *** mm/mincore.c:
> > mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  fs/proc/task_mmu.c  | 17 +++++++++++++----
> >  mm/madvise.c        |  8 ++++++--
> >  mm/memcontrol.c     |  8 ++++++--
> >  mm/memory-failure.c |  4 +++-
> >  mm/mempolicy.c      |  4 +++-
> >  5 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 6259dd432eeb..823eaba5c6bf 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >                 goto out;
> >         }
> >
> > -       if (pmd_trans_unstable(pmd))
> > +       if (pmd_trans_unstable(pmd)) {
> > +               walk->action = ACTION_AGAIN;
> >                 goto out;
> > +       }
> > +
> >         /*
> >          * The mmap_lock held all the way back in m_start() is what
> >          * keeps khugepaged out of here and from collapsing things
> > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> >                 return 0;
> >         }
> >
> > -       if (pmd_trans_unstable(pmd))
> > +       if (pmd_trans_unstable(pmd)) {
> > +               walk->action = ACTION_AGAIN;
> >                 return 0;
> > +       }
> >
> >         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >         for (; addr != end; pte++, addr += PAGE_SIZE) {
> > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> >                 return err;
> >         }
> >
> > -       if (pmd_trans_unstable(pmdp))
> > +       if (pmd_trans_unstable(pmdp)) {
> > +               walk->action = ACTION_AGAIN;
> >                 return 0;
> 
> Had a quick look at the pagemap code, I agree with your analysis,
> "returning 0" may mess up pagemap, retry should be fine. But I'm
> wondering whether we should just fill in empty entries. Anyway I don't
> have a  strong opinion on this, just a little bit concerned by
> potential indefinite retry.

Yes, none pte is still an option.  But if we're going to fix this anyway,
it seems better to fix it with the accurate new thing that poped up, and
it's even less change (just apply walk->action rather than doing random
stuff in different call sites).

I see that you have worry on deadloop over this, so I hope to discuss
altogether here.

Unlike normal checks, pmd_trans_unstable() check means something must have
changed in the past very short period or it should just never if nothing
changed concurrently from under us, so it's not a "if (flag==true)" check
which is even more likely to loop.

If we see the places that I didn't touch, most of them suggested a retry in
one form or another.  So if there's a worry this will also not the first
time to do a retry (and for such a "unstable" API, that's really the most
natural thing to do which is to retry until it's stable).

So in general, it seems to me if we deadloop over pmd_trans_unstable() for
whatever reason then something more wrong could have happened..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry
  2023-06-05 19:20     ` Peter Xu
@ 2023-06-06 19:12       ` Yang Shi
  2023-06-06 19:59         ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2023-06-06 19:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > For most of the page walk paths, logically it'll always be good to have the
> > > pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
> > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > > even treating that as a none pmd.  If to fix it anyway, a retry will be the
> > > most accurate.
> > >
> > > I've went over all the pmd_trans_unstable() special cases and this patch
> > > should cover all the rest places where we should retry properly with
> > > unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
> > > easily achieve that.
> > >
> > > These are the call sites that I think should be fixed with it:
> > >
> > > *** fs/proc/task_mmu.c:
> > > smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
> > > clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
> > > pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
> > > gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
> > > *** mm/memcontrol.c:
> > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > > *** mm/memory-failure.c:
> > > hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
> > > *** mm/mempolicy.c:
> > > queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
> > > *** mm/madvise.c:
> > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > > madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))
> > >
> > > IIUC most of them may or may not be a big issue even without a retry,
> > > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > > cold worst case), but some of them could have functional error without the
> > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> > >
> > > While these call sites all look fine, and don't need any change:
> > >
> > > *** include/linux/pgtable.h:
> > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > > *** mm/gup.c:
> > > follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
> > > *** mm/mapping_dirty_helpers.c:
> > > wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
> > > *** mm/memory.c:
> > > do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > > *** mm/migrate_device.c:
> > > migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
> > > *** mm/mincore.c:
> > > mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  fs/proc/task_mmu.c  | 17 +++++++++++++----
> > >  mm/madvise.c        |  8 ++++++--
> > >  mm/memcontrol.c     |  8 ++++++--
> > >  mm/memory-failure.c |  4 +++-
> > >  mm/mempolicy.c      |  4 +++-
> > >  5 files changed, 31 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 6259dd432eeb..823eaba5c6bf 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > >                 goto out;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmd))
> > > +       if (pmd_trans_unstable(pmd)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 goto out;
> > > +       }
> > > +
> > >         /*
> > >          * The mmap_lock held all the way back in m_start() is what
> > >          * keeps khugepaged out of here and from collapsing things
> > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > >                 return 0;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmd))
> > > +       if (pmd_trans_unstable(pmd)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 return 0;
> > > +       }
> > >
> > >         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > >         for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > >                 return err;
> > >         }
> > >
> > > -       if (pmd_trans_unstable(pmdp))
> > > +       if (pmd_trans_unstable(pmdp)) {
> > > +               walk->action = ACTION_AGAIN;
> > >                 return 0;
> >
> > Had a quick look at the pagemap code, I agree with your analysis,
> > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > wondering whether we should just fill in empty entries. Anyway I don't
> > have a  strong opinion on this, just a little bit concerned by
> > potential indefinite retry.
>
> Yes, none pte is still an option.  But if we're going to fix this anyway,
> it seems better to fix it with the accurate new thing that poped up, and
> it's even less change (just apply walk->action rather than doing random
> stuff in different call sites).
>
> I see that you have worry on deadloop over this, so I hope to discuss
> altogether here.
>
> Unlike normal checks, pmd_trans_unstable() check means something must have
> changed in the past very short period or it should just never if nothing
> changed concurrently from under us, so it's not a "if (flag==true)" check
> which is even more likely to loop.
>
> If we see the places that I didn't touch, most of them suggested a retry in
> one form or another.  So if there's a worry this will also not the first
> time to do a retry (and for such a "unstable" API, that's really the most
> natural thing to do which is to retry until it's stable).

IIUC other than do_anonymous_page() suggests retry (retry page fault),
others may not, for example:
  - follow_pmd_mask: return -EBUSY
  - wp_clean_pmd_entry: actually just retry for pmd_none case, but the
pagewalk code does handle pmd_none by skipping it, so it basically
just retry once
  - min_core_pte_range: treated as unmapped range by calling
__mincore_unmapped_range

Anyway I really don't have a strong opinion on this. I may be just
over-concerned. I just thought if nobody cares whether the result is
accurate or not, why do we bother fixing those cases?

>
> So in general, it seems to me if we deadloop over pmd_trans_unstable() for
> whatever reason then something more wrong could have happened..
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry
  2023-06-06 19:12       ` Yang Shi
@ 2023-06-06 19:59         ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-06 19:59 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-kernel, linux-mm, David Hildenbrand, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Tue, Jun 06, 2023 at 12:12:03PM -0700, Yang Shi wrote:
> > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > > >                 return err;
> > > >         }
> > > >
> > > > -       if (pmd_trans_unstable(pmdp))
> > > > +       if (pmd_trans_unstable(pmdp)) {
> > > > +               walk->action = ACTION_AGAIN;
> > > >                 return 0;
> > >
> > > Had a quick look at the pagemap code, I agree with your analysis,
> > > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > > wondering whether we should just fill in empty entries. Anyway I don't
> > > have a  strong opinion on this, just a little bit concerned by
> > > potential indefinite retry.
> >
> > Yes, none pte is still an option.  But if we're going to fix this anyway,
> > it seems better to fix it with the accurate new thing that poped up, and
> > it's even less change (just apply walk->action rather than doing random
> > stuff in different call sites).
> >
> > I see that you have worry on deadloop over this, so I hope to discuss
> > altogether here.
> >
> > Unlike normal checks, pmd_trans_unstable() check means something must have
> > changed in the past very short period or it should just never if nothing
> > changed concurrently from under us, so it's not a "if (flag==true)" check
> > which is even more likely to loop.
> >
> > If we see the places that I didn't touch, most of them suggested a retry in
> > one form or another.  So if there's a worry this will also not the first
> > time to do a retry (and for such a "unstable" API, that's really the most
> > natural thing to do which is to retry until it's stable).
> 
> IIUC other than do_anonymous_page() suggests retry (retry page fault),
> others may not, for example:
>   - follow_pmd_mask: return -EBUSY

I assumed a -EBUSY would mean if the caller still needs the page it'll just
need to retry.

It's actually a very rare errno to return in follow page... e.g. some gup
callers may not even be able to handle -EBUSY afaiu, neither does the gup
core (__get_user_pages), afaiu it'll just forwarded upwards.

My bet is it's just so rare and only used with FOLL_SPLIT_PMD for now.  I
had a quick look, at least kvm handles -EBUSY as a real fault (hva_to_pfn,
where it should translate that -EBUSY into a KVM_PFN_ERR_FAULT), I think
it'll crash the hypervisor directly if returned from gup one day (not for
now if never with !FOLL_SPLIT_PMD)..

>   - wp_clean_pmd_entry: actually just retry for pmd_none case, but the
> pagewalk code does handle pmd_none by skipping it, so it basically
> just retry once

This one is very special IMHO, pmd_trans_unstable() should in most cases be
used after someone checked pmd value before walking ptes.  I had a feeling
it's kind of abused, to check whether it's a huge pmd in whatever format..
IMHO it should just use the other pmd_*() helpers but I won't touch it in
this series.

>   - min_core_pte_range: treated as unmapped range by calling
> __mincore_unmapped_range

Correct.

Also pmd_devmap_trans_unstable(), called in 3 call sites:

pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
    filemap_map_pmd[3423]          if (pmd_devmap_trans_unstable(vmf->pmd)) {
    finish_fault[4390]             if (pmd_devmap_trans_unstable(vmf->pmd))
    handle_pte_fault[4922]         if (pmd_devmap_trans_unstable(vmf->pmd))

They all suggest a retry on the page faults, afaiu.

So indeed not all of them retries, but I doubt those ones that are not and
whether that's the best we should have.  Again, I'll leave that out of this
series.

> 
> Anyway I really don't have a strong opinion on this. I may be just
> over-concerned. I just thought if nobody cares whether the result is
> accurate or not, why do we bother fixing those cases?

Because anyway we're already at it and it's easier and cleaner to do so? :)

I would say if to fix this I think the most important ones are pagemap and
memcg paths so far, but since I'm at it and anyway I checked all of them, I
figured maybe I should just make everywhere do right and in the same
pattern when handling unstable pmd. Especially, what this patch touched are
all using walk_page*() routines (I left special ones in first patches), so
it's quite straightforward IMHO to switch altogether using ACTION_AGAIN.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
  2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
                   ` (3 preceding siblings ...)
  2023-06-02 23:05 ` [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry Peter Xu
@ 2023-06-07 13:49 ` Peter Xu
  2023-06-07 15:45   ` David Hildenbrand
  4 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-06-07 13:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: David Hildenbrand, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

On Fri, Jun 02, 2023 at 07:05:48PM -0400, Peter Xu wrote:
> Please have a look, thanks.

Hello, all,

This one seems to have more or less conflict with Hugh's rework on pmd
collapse.  Please hold off review or merging until I prepare another one
(probably based on Hugh's, after I have a closer read).

Sorry for the noise.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
  2023-06-07 13:49 ` [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
@ 2023-06-07 15:45   ` David Hildenbrand
  2023-06-07 16:21     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2023-06-07 15:45 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Alistair Popple, Andrew Morton, Andrea Arcangeli,
	Kirill A . Shutemov, Johannes Weiner, John Hubbard,
	Naoya Horiguchi, Muhammad Usama Anjum, Hugh Dickins,
	Mike Rapoport

On 07.06.23 15:49, Peter Xu wrote:
> On Fri, Jun 02, 2023 at 07:05:48PM -0400, Peter Xu wrote:
>> Please have a look, thanks.
> 
> Hello, all,
> 
> This one seems to have more or less conflict with Hugh's rework on pmd
> collapse.  Please hold off review or merging until I prepare another one
> (probably based on Hugh's, after I have a closer read).
> 
> Sorry for the noise.
> 

[did not have time to look yet]

Are there any fixes buried in there that we'd like to have in earlier? I 
skimmed over the patches and all read like "cleanup" + "consistency", 
correct?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
  2023-06-07 15:45   ` David Hildenbrand
@ 2023-06-07 16:21     ` Peter Xu
  2023-06-07 16:39       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-06-07 16:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Alistair Popple, Andrew Morton,
	Andrea Arcangeli, Kirill A . Shutemov, Johannes Weiner,
	John Hubbard, Naoya Horiguchi, Muhammad Usama Anjum,
	Hugh Dickins, Mike Rapoport

On Wed, Jun 07, 2023 at 05:45:28PM +0200, David Hildenbrand wrote:
> On 07.06.23 15:49, Peter Xu wrote:
> > On Fri, Jun 02, 2023 at 07:05:48PM -0400, Peter Xu wrote:
> > > Please have a look, thanks.
> > 
> > Hello, all,
> > 
> > This one seems to have more or less conflict with Hugh's rework on pmd
> > collapse.  Please hold off review or merging until I prepare another one
> > (probably based on Hugh's, after I have a closer read).
> > 
> > Sorry for the noise.
> > 
> 
> [did not have time to look yet]
> 
> Are there any fixes buried in there that we'd like to have in earlier? I
> skimmed over the patches and all read like "cleanup" + "consistency",
> correct?

There are bug fixes when unluckily hitting unstable pmd I think, these ones
worth mentioning:

  - pagemap can be broken, causing read to be shifted over to the next
    (wrong data read)

  - memcg wrong accounting, e.g., moving one task from memcg1 to memcg2, we
    can skip an unstable pmd while it could quickly contain something that
    can belong to memcg1, I think.  This one needs some eyes from memcg
    developers.

I don't rush on having them because these are all theoretical and no bug
report I saw, no reproducer I wrote, only observed by my eyes.

At least the pagemap issue should have been there for 10+ years without
being noticed even if rightfully spot this time.  Meanwhile this seems to
have conflict with Hugh's series which should have been posted earlier - I
still need to check on how that will affect this series, but not yet.

Said that, let me know if any of you hit any (potential) issue with above
or think that we should to move this in earlier.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
  2023-06-07 16:21     ` Peter Xu
@ 2023-06-07 16:39       ` Yang Shi
  2023-06-07 18:22         ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2023-06-07 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Wed, Jun 7, 2023 at 9:21 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 07, 2023 at 05:45:28PM +0200, David Hildenbrand wrote:
> > On 07.06.23 15:49, Peter Xu wrote:
> > > On Fri, Jun 02, 2023 at 07:05:48PM -0400, Peter Xu wrote:
> > > > Please have a look, thanks.
> > >
> > > Hello, all,
> > >
> > > This one seems to have more or less conflict with Hugh's rework on pmd
> > > collapse.  Please hold off review or merging until I prepare another one
> > > (probably based on Hugh's, after I have a closer read).
> > >
> > > Sorry for the noise.
> > >
> >
> > [did not have time to look yet]
> >
> > Are there any fixes buried in there that we'd like to have in earlier? I
> > skimmed over the patches and all read like "cleanup" + "consistency",
> > correct?
>
> There are bug fixes when unluckily hitting unstable pmd I think, these ones
> worth mentioning:
>
>   - pagemap can be broken, causing read to be shifted over to the next
>     (wrong data read)

Yes, it may corrupt the pagemap data. But anyway it seems like nobody
was busted by this one as you said.

>
>   - memcg wrong accounting, e.g., moving one task from memcg1 to memcg2, we
>     can skip an unstable pmd while it could quickly contain something that
>     can belong to memcg1, I think.  This one needs some eyes from memcg
>     developers.

I don't think this is an important thing. There are plenty of other
conditions that could make the accounting inaccurate, for example,
isolating page from LRU fails, force charge, etc. And it seems like
nobody was bothered by this either.

>
> I don't rush on having them because these are all theoretical and no bug
> report I saw, no reproducer I wrote, only observed by my eyes.
>
> At least the pagemap issue should have been there for 10+ years without
> being noticed even if rightfully spot this time.  Meanwhile this seems to
> have conflict with Hugh's series which should have been posted earlier - I
> still need to check on how that will affect this series, but not yet.
>
> Said that, let me know if any of you hit any (potential) issue with above
> or think that we should to move this in earlier.
>
> Thanks,
>
> --
> Peter Xu
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry
  2023-06-07 16:39       ` Yang Shi
@ 2023-06-07 18:22         ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-06-07 18:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alistair Popple,
	Andrew Morton, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, John Hubbard, Naoya Horiguchi,
	Muhammad Usama Anjum, Hugh Dickins, Mike Rapoport

On Wed, Jun 07, 2023 at 09:39:44AM -0700, Yang Shi wrote:
> I don't think this is an important thing. There are plenty of other
> conditions that could make the accounting inaccurate, for example,
> isolating page from LRU fails, force charge, etc. And it seems like
> nobody was bothered by this either.

Yes, I read that a bit more and I agree.  So let me summarize after I read
Hugh's series just now..

With the pre-requisite of the new __pte_map_offset() that Hugh proposed
here:

[PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail
https://lore.kernel.org/r/8218ffdc-8be-54e5-0a8-83f5542af283@google.com

We should not need pmd_trans_unstable() anymore as Hugh pointed out, which
I fully agree.  I think Hugh has covered all the issues that this series
wanted to address alongside, namely:

  Patch 1 (mprotect) is covered in:

  [PATCH 18/31] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
  https://lore.kernel.org/r/4a834932-9064-9ed7-3cd1-99466f549486@google.com

  No way to move spinlock outside, so one -EAGAIN needed which makes sense.

  Patch 2 (migrate_device) is covered in:

  [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to fail
  https://lore.kernel.org/r/ea51bb69-189c-229b-fc0-9d3e7be5d6b@google.com

  By a direct retry, and more code unified so even better.

  Patch 3 () is covered in:

  [PATCH 19/31] mm/mremap: retry if either pte_offset_map_*lock() fails
  https://lore.kernel.org/r/2d3fbfea-5884-8211-0cc-954afe25ae9c@google.com

  Instead of WARN_ON_ONCE(), it got dropped which looks all good, too.

  Most of patch 4's changes are done in:

  [PATCH 09/31] mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
  https://lore.kernel.org/r/6265ac58-6018-a8c6-cf38-69cba698471@google.com

  There're some different handling on memcg changes, where in Hugh's
  series it was put separately here:

  [PATCH 17/31] mm/various: give up if pte_offset_map[_lock]() fails
  https://lore.kernel.org/r/c299eba-4e17-c645-1115-ccd1fd9956bd@google.com

  After double check, I agree with Yang's comment and Hugh's approach,
  that no retry is needed for memcg.

Let's ignore this series, not needed anymore.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-06-07 18:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 23:05 [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
2023-06-02 23:05 ` [PATCH 1/4] mm/mprotect: Retry on pmd_trans_unstable() Peter Xu
2023-06-03  2:04   ` Yang Shi
2023-06-04 23:58     ` Peter Xu
2023-06-02 23:05 ` [PATCH 2/4] mm/migrate: Unify and retry an unstable pmd when hit Peter Xu
2023-06-02 23:05 ` [PATCH 3/4] mm: Warn for unstable pmd in move_page_tables() Peter Xu
2023-06-02 23:05 ` [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry Peter Xu
2023-06-05 18:46   ` Yang Shi
2023-06-05 19:20     ` Peter Xu
2023-06-06 19:12       ` Yang Shi
2023-06-06 19:59         ` Peter Xu
2023-06-07 13:49 ` [PATCH 0/4] mm: Fix pmd_trans_unstable() call sites on retry Peter Xu
2023-06-07 15:45   ` David Hildenbrand
2023-06-07 16:21     ` Peter Xu
2023-06-07 16:39       ` Yang Shi
2023-06-07 18:22         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).