linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
@ 2022-02-06 21:27 Hugh Dickins
  2022-02-06 21:30 ` [PATCH 01/13] mm/munlock: delete page_mlock() and all its works Hugh Dickins
                   ` (15 more replies)
  0 siblings, 16 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

I wondered whether to post this munlocking rework in
https://lore.kernel.org/linux-mm/35c340a6-96f-28a0-2b7b-2f9fbddc01f@google.com/

There the discussion was OOM reaping, but the main reason for the rework
has been catastrophic contention on i_mmap_rwsem when exiting from
multiply mlocked files; and frustration with how contorted munlocking is.

Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
likely to conflict (I hope not fundamentally) with several concurrent
large patchsets.

tl;dr
 mm/mlock.c  |  634 +++++++++++++++-----------------------
 23 files changed, 510 insertions(+), 779 deletions(-)

In my own opinion, this is ready to go in: that whatever its defects,
it's a big practical improvement over what's presently there.  Others
may differ; and it may be too much to arrive at a quick opinion on.
My hope is that it will strike a chord with some who have laboured in
this area in the past: I'll be short of time to do much more with it,
so maybe someone else could take over if necessary.

At present there's no update to Documentation/vm/unevictable-lru.rst:
that always needs a different mindset, can follow later, please refer
to commit messages for now.

There are two half-related mm/thp patches at the end: enhancements
we've had for a long time, but more suited after the mlock changes.

01/13 mm/munlock: delete page_mlock() and all its works
02/13 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE gupflags
03/13 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
04/13 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
05/13 mm/munlock: replace clear_page_mlock() by final clearance
06/13 mm/munlock: maintain page->mlock_count while unevictable
07/13 mm/munlock: mlock_pte_range() when mlocking or munlocking
08/13 mm/migrate: __unmap_and_move() push good newpage to LRU
09/13 mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
10/13 mm/munlock: mlock_page() munlock_page() batch by pagevec
11/13 mm/munlock: page migration needs mlock pagevec drained
12/13 mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
13/13 mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

 include/linux/mm.h        |    2 
 include/linux/mm_inline.h |   11 
 include/linux/mm_types.h  |   19 +
 include/linux/rmap.h      |   23 -
 kernel/events/uprobes.c   |    7 
 mm/gup.c                  |   43 --
 mm/huge_memory.c          |   55 ---
 mm/hugetlb.c              |    4 
 mm/internal.h             |   64 ++-
 mm/khugepaged.c           |   14 
 mm/ksm.c                  |   12 
 mm/madvise.c              |    5 
 mm/memcontrol.c           |    3 
 mm/memory.c               |   45 --
 mm/migrate.c              |   42 +-
 mm/mlock.c                |  634 +++++++++++++++-----------------------
 mm/mmap.c                 |   32 -
 mm/mmzone.c               |    7 
 mm/oom_kill.c             |    2 
 mm/rmap.c                 |  156 ++-------
 mm/swap.c                 |   89 ++---
 mm/userfaultfd.c          |   14 
 mm/vmscan.c               |    6 
 23 files changed, 510 insertions(+), 779 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
@ 2022-02-06 21:30 ` Hugh Dickins
  2022-02-09 18:31   ` Vlastimil Babka
  2022-02-06 21:32 ` [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE Hugh Dickins
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant.  Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder.  Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/rmap.h |   6 -
 mm/internal.h        |   2 +-
 mm/mlock.c           | 377 +++----------------------------------------
 mm/rmap.c            |  80 ---------
 4 files changed, 25 insertions(+), 440 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e704b1a4c06c..dc48aa8c2c94 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
  */
 int folio_mkclean(struct folio *);
 
-/*
- * called in munlock()/munmap() path to check for other vmas holding
- * the page mlocked.
- */
-void page_mlock(struct page *page);
-
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..e48c486d5ddf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -409,7 +409,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
  * must be called with vma's mmap_lock held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);
 
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..544c18ce2c58 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -46,12 +46,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * be placed on the LRU "unevictable" list, rather than the [in]active lists.
  * The unevictable list is an LRU sibling list to the [in]active lists.
  * PageUnevictable is set to indicate the unevictable state.
- *
- * When lazy mlocking via vmscan, it is important to ensure that the
- * vma's VM_LOCKED status is not concurrently being modified, otherwise we
- * may have mlocked a page that is being munlocked. So lazy mlock must take
- * the mmap_lock for read, and verify that the vma really is locked
- * (see mm/rmap.c).
  */
 
 /*
@@ -106,299 +100,28 @@ void mlock_vma_page(struct page *page)
 	}
 }
 
-/*
- * Finish munlock after successful page isolation
- *
- * Page must be locked. This is a wrapper for page_mlock()
- * and putback_lru_page() with munlock accounting.
- */
-static void __munlock_isolated_page(struct page *page)
-{
-	/*
-	 * Optimization: if the page was mapped just once, that's our mapping
-	 * and we don't need to check all the other vmas.
-	 */
-	if (page_mapcount(page) > 1)
-		page_mlock(page);
-
-	/* Did try_to_unlock() succeed or punt? */
-	if (!PageMlocked(page))
-		count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
-
-	putback_lru_page(page);
-}
-
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	int nr_pages = thp_nr_pages(page);
-
-	if (PageUnevictable(page))
-		__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	else
-		__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-}
-
 /**
  * munlock_vma_page - munlock a vma page
  * @page: page to be unlocked, either a normal page or THP page head
- *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
- *
- * called from munlock()/munmap() path with page supposedly on the LRU.
- * When we munlock a page, because the vma where we found the page is being
- * munlock()ed or munmap()ed, we want to check whether other vmas hold the
- * page locked so that we can leave it on the unevictable lru list and not
- * bother vmscan with it.  However, to walk the page's rmap list in
- * page_mlock() we must isolate the page from the LRU.  If some other
- * task has removed the page from the LRU, we won't be able to do that.
- * So we clear the PageMlocked as we might not get another chance.  If we
- * can't isolate the page, we leave it for putback_lru_page() and vmscan
- * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
 {
-	int nr_pages;
-
-	/* For page_mlock() and to serialize with page migration */
+	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (!TestClearPageMlocked(page)) {
-		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		return 0;
-	}
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-
-	if (!isolate_lru_page(page))
-		__munlock_isolated_page(page);
-	else
-		__munlock_isolation_failed(page);
-
-	return nr_pages - 1;
-}
-
-/*
- * convert get_user_pages() return value to posix mlock() error
- */
-static int __mlock_posix_error_return(long retval)
-{
-	if (retval == -EFAULT)
-		retval = -ENOMEM;
-	else if (retval == -ENOMEM)
-		retval = -EAGAIN;
-	return retval;
-}
-
-/*
- * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
- *
- * The fast path is available only for evictable pages with single mapping.
- * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need page_mlock() which can fail.
- * when !page_evictable(), we need the full redo logic of putback_lru_page to
- * avoid leaving evictable page in unevictable list.
- *
- * In case of success, @page is added to @pvec and @pgrescued is incremented
- * in case that the page was previously unevictable. @page is also unlocked.
- */
-static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
-		int *pgrescued)
-{
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	if (page_mapcount(page) <= 1 && page_evictable(page)) {
-		pagevec_add(pvec, page);
-		if (TestClearPageUnevictable(page))
-			(*pgrescued)++;
-		unlock_page(page);
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * Putback multiple evictable pages to the LRU
- *
- * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
- * the pages might have meanwhile become unevictable but that is OK.
- */
-static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
-{
-	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
-	/*
-	 *__pagevec_lru_add() calls release_pages() so we don't call
-	 * put_page() explicitly
-	 */
-	__pagevec_lru_add(pvec);
-	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
-}
 
-/*
- * Munlock a batch of pages from the same zone
- *
- * The work is split to two main phases. First phase clears the Mlocked flag
- * and attempts to isolate the pages, all under a single zone lru lock.
- * The second phase finishes the munlock only for pages where isolation
- * succeeded.
- *
- * Note that the pagevec may be modified during the process.
- */
-static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
-{
-	int i;
-	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
-	struct pagevec pvec_putback;
-	struct lruvec *lruvec = NULL;
-	int pgrescued = 0;
-
-	pagevec_init(&pvec_putback);
-
-	/* Phase 1: page isolation */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-		struct folio *folio = page_folio(page);
-
-		if (TestClearPageMlocked(page)) {
-			/*
-			 * We already have pin from follow_page_mask()
-			 * so we can spare the get_page() here.
-			 */
-			if (TestClearPageLRU(page)) {
-				lruvec = folio_lruvec_relock_irq(folio, lruvec);
-				del_page_from_lru_list(page, lruvec);
-				continue;
-			} else
-				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
-		}
+	VM_BUG_ON_PAGE(PageTail(page), page);
 
-		/*
-		 * We won't be munlocking this page in the next phase
-		 * but we still need to release the follow_page_mask()
-		 * pin. We cannot do it under lru_lock however. If it's
-		 * the last pin, __page_cache_release() would deadlock.
-		 */
-		pagevec_add(&pvec_putback, pvec->pages[i]);
-		pvec->pages[i] = NULL;
-	}
-	if (lruvec) {
-		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-		unlock_page_lruvec_irq(lruvec);
-	} else if (delta_munlocked) {
-		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	}
+	if (TestClearPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
 
-	/* Now we can release pins of pages that we are not munlocking */
-	pagevec_release(&pvec_putback);
-
-	/* Phase 2: page munlock */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-
-		if (page) {
-			lock_page(page);
-			if (!__putback_lru_fast_prepare(page, &pvec_putback,
-					&pgrescued)) {
-				/*
-				 * Slow path. We don't want to lose the last
-				 * pin before unlock_page()
-				 */
-				get_page(page); /* for putback_lru_page() */
-				__munlock_isolated_page(page);
-				unlock_page(page);
-				put_page(page); /* from follow_page_mask() */
-			}
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (!isolate_lru_page(page)) {
+			putback_lru_page(page);
+			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		} else if (PageUnevictable(page)) {
+			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 		}
 	}
-
-	/*
-	 * Phase 3: page putback for pages that qualified for the fast path
-	 * This will also call put_page() to return pin from follow_page_mask()
-	 */
-	if (pagevec_count(&pvec_putback))
-		__putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
-			struct vm_area_struct *vma, struct zone *zone,
-			unsigned long start, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	/*
-	 * Initialize pte walk starting at the already pinned page where we
-	 * are sure that there is a pte, as it was pinned under the same
-	 * mmap_lock write op.
-	 */
-	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
-	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
-
-	/* The page next to the pinned page is the first we will try to get */
-	start += PAGE_SIZE;
-	while (start < end) {
-		struct page *page = NULL;
-		pte++;
-		if (pte_present(*pte))
-			page = vm_normal_page(vma, start, *pte);
-		/*
-		 * Break if page could not be obtained or the page's node+zone does not
-		 * match
-		 */
-		if (!page || page_zone(page) != zone)
-			break;
-
-		/*
-		 * Do not use pagevec for PTE-mapped THP,
-		 * munlock_vma_pages_range() will handle them.
-		 */
-		if (PageTransCompound(page))
-			break;
-
-		get_page(page);
-		/*
-		 * Increase the address that will be returned *before* the
-		 * eventual break due to pvec becoming full by adding the page
-		 */
-		start += PAGE_SIZE;
-		if (pagevec_add(pvec, page) == 0)
-			break;
-	}
-	pte_unmap_unlock(pte, ptl);
-	return start;
 }
 
 /*
@@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  *
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
- *
- * We don't save and restore VM_LOCKED here because pages are
- * still on lru.  In unmap path, pages might be scanned by reclaim
- * and re-mlocked by page_mlock/try_to_unmap before we unmap and
- * free them.  This will result in freeing mlocked pages.
  */
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-
-	while (start < end) {
-		struct page *page;
-		unsigned int page_mask = 0;
-		unsigned long page_increm;
-		struct pagevec pvec;
-		struct zone *zone;
-
-		pagevec_init(&pvec);
-		/*
-		 * Although FOLL_DUMP is intended for get_dump_page(),
-		 * it just so happens that its special treatment of the
-		 * ZERO_PAGE (returning an error instead of doing get_page)
-		 * suits munlock very well (and if somehow an abnormal page
-		 * has sneaked into the range, we won't oops here: great).
-		 */
-		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
-
-		if (page && !IS_ERR(page)) {
-			if (PageTransTail(page)) {
-				VM_BUG_ON_PAGE(PageMlocked(page), page);
-				put_page(page); /* follow_page_mask() */
-			} else if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to compute
-				 * the page_mask here instead.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zone, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
-		}
-		page_increm = 1 + page_mask;
-		start += page_increm * PAGE_SIZE;
-next:
-		cond_resched();
-	}
+	/* Reimplementation to follow in later commit */
 }
 
 /*
@@ -645,6 +304,18 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
 	return count >> PAGE_SHIFT;
 }
 
+/*
+ * convert get_user_pages() return value to posix mlock() error
+ */
+static int __mlock_posix_error_return(long retval)
+{
+	if (retval == -EFAULT)
+		retval = -ENOMEM;
+	else if (retval == -ENOMEM)
+		retval = -EAGAIN;
+	return retval;
+}
+
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
 {
 	unsigned long locked;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..7ce7f1946cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1996,76 +1996,6 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
 		rmap_walk(page, &rwc);
 }
 
-/*
- * Walks the vma's mapping a page and mlocks the page if any locked vma's are
- * found. Once one is found the page is locked and the scan can be terminated.
- */
-static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address, void *unused)
-{
-	struct page_vma_mapped_walk pvmw = {
-		.page = page,
-		.vma = vma,
-		.address = address,
-	};
-
-	/* An un-locked vma doesn't have any pages to lock, continue the scan */
-	if (!(vma->vm_flags & VM_LOCKED))
-		return true;
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/*
-		 * Need to recheck under the ptl to serialise with
-		 * __munlock_pagevec_fill() after VM_LOCKED is cleared in
-		 * munlock_vma_pages_range().
-		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked; but
-			 * this function is never called on a DoubleMap THP,
-			 * nor on an Anon THP (which may still be PTE-mapped
-			 * after DoubleMap was cleared).
-			 */
-			mlock_vma_page(page);
-			/*
-			 * No need to scan further once the page is marked
-			 * as mlocked.
-			 */
-			page_vma_mapped_walk_done(&pvmw);
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/**
- * page_mlock - try to mlock a page
- * @page: the page to be mlocked
- *
- * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
- * the page if any are found. The page will be returned with PG_mlocked cleared
- * if it is not mapped by any locked vmas.
- */
-void page_mlock(struct page *page)
-{
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_mlock_one,
-		.done = page_not_mapped,
-		.anon_lock = page_lock_anon_vma_read,
-
-	};
-
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
-
-	/* Anon THP are only marked as mlocked when singly mapped */
-	if (PageTransCompound(page) && PageAnon(page))
-		return;
-
-	rmap_walk(page, &rwc);
-}
-
 #ifdef CONFIG_DEVICE_PRIVATE
 struct make_exclusive_args {
 	struct mm_struct *mm;
@@ -2291,11 +2221,6 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
@@ -2344,11 +2269,6 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
  2022-02-06 21:30 ` [PATCH 01/13] mm/munlock: delete page_mlock() and all its works Hugh Dickins
@ 2022-02-06 21:32 ` Hugh Dickins
  2022-02-10 11:35   ` Vlastimil Babka
  2022-02-06 21:34 ` [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap Hugh Dickins
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right?  I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before?  Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area?  I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |  2 --
 mm/gup.c           | 43 ++++++++-----------------------------------
 mm/huge_memory.c   | 33 ---------------------------------
 3 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..74ee50c2033b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2925,13 +2925,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
-#define FOLL_MLOCK	0x1000	/* lock present pages */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..2076902344d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -572,32 +572,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 */
 		mark_page_accessed(page);
 	}
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/* Do not mlock pte-mapped THP */
-		if (PageTransCompound(page))
-			goto out;
-
-		/*
-		 * The preliminary mapping check is mainly to avoid the
-		 * pointless overhead of lock_page on the ZERO_PAGE
-		 * which might bounce very badly if there is contention.
-		 *
-		 * If the page is already locked, we don't need to
-		 * handle it now - vmscan will handle it later if and
-		 * when it attempts to reclaim the page.
-		 */
-		if (page->mapping && trylock_page(page)) {
-			lru_add_drain();  /* push cached pages to LRU */
-			/*
-			 * Because we lock page here, and migration is
-			 * blocked by the pte's page reference, and we
-			 * know the page is still mapped, we don't even
-			 * need to check for file-cache page truncation.
-			 */
-			mlock_vma_page(page);
-			unlock_page(page);
-		}
-	}
 out:
 	pte_unmap_unlock(ptep, ptl);
 	return page;
@@ -920,9 +894,6 @@ static int faultin_page(struct vm_area_struct *vma,
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
 
-	/* mlock all present pages, but do not fault in new pages */
-	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
-		return -ENOENT;
 	if (*flags & FOLL_NOFAULT)
 		return -EFAULT;
 	if (*flags & FOLL_WRITE)
@@ -1173,8 +1144,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			case -ENOMEM:
 			case -EHWPOISON:
 				goto out;
-			case -ENOENT:
-				goto next_page;
 			}
 			BUG();
 		} else if (PTR_ERR(page) == -EEXIST) {
@@ -1472,9 +1441,14 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
 	mmap_assert_locked(mm);
 
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
+	/*
+	 * Rightly or wrongly, the VM_LOCKONFAULT case has never used
+	 * faultin_page() to break COW, so it has no work to do here.
+	 */
 	if (vma->vm_flags & VM_LOCKONFAULT)
-		gup_flags &= ~FOLL_POPULATE;
+		return nr_pages;
+
+	gup_flags = FOLL_TOUCH;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -1541,10 +1515,9 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
 	 *	       in the page table.
 	 * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
 	 *		  a poisoned page.
-	 * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
 	 * !FOLL_FORCE: Require proper access permissions.
 	 */
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+	gup_flags = FOLL_TOUCH | FOLL_HWPOISON;
 	if (write)
 		gup_flags |= FOLL_WRITE;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9a34b85ebcf8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,39 +1380,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
 
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/*
-		 * We don't mlock() pte-mapped THPs. This way we can avoid
-		 * leaking mlocked pages into non-VM_LOCKED VMAs.
-		 *
-		 * For anon THP:
-		 *
-		 * In most cases the pmd is the only mapping of the page as we
-		 * break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
-		 * writable private mappings in populate_vma_page_range().
-		 *
-		 * The only scenario when we have the page shared here is if we
-		 * mlocking read-only mapping shared over fork(). We skip
-		 * mlocking such pages.
-		 *
-		 * For file THP:
-		 *
-		 * We can expect PageDoubleMap() to be stable under page lock:
-		 * for file pages we set it in page_add_file_rmap(), which
-		 * requires page to be locked.
-		 */
-
-		if (PageAnon(page) && compound_mapcount(page) != 1)
-			goto skip_mlock;
-		if (PageDoubleMap(page) || !page->mapping)
-			goto skip_mlock;
-		if (!trylock_page(page))
-			goto skip_mlock;
-		if (page->mapping && !PageDoubleMap(page))
-			mlock_vma_page(page);
-		unlock_page(page);
-	}
-skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
  2022-02-06 21:30 ` [PATCH 01/13] mm/munlock: delete page_mlock() and all its works Hugh Dickins
  2022-02-06 21:32 ` [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE Hugh Dickins
@ 2022-02-06 21:34 ` Hugh Dickins
  2022-02-10 15:30   ` Vlastimil Babka
  2022-02-06 21:36 ` [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Hugh Dickins
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte.  Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/internal.h | 16 ++--------------
 mm/madvise.c  |  5 +++++
 mm/mlock.c    |  4 ++--
 mm/mmap.c     | 32 ++------------------------------
 mm/oom_kill.c |  2 +-
 5 files changed, 12 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e48c486d5ddf..f235aa92e564 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -71,11 +71,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
-static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
-{
-	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
-}
-
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
@@ -398,12 +393,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
 extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
 				   bool write, int *locked);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
 
 /*
  * must be called with vma's mmap_lock held for read or write, and page locked.
@@ -411,9 +402,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern void munlock_vma_page(struct page *page);
 
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-			      unsigned long len);
-
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..ae35d72627ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -530,6 +530,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
+{
+	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
+}
+
 static long madvise_cold(struct vm_area_struct *vma,
 			struct vm_area_struct **prev,
 			unsigned long start_addr, unsigned long end_addr)
diff --git a/mm/mlock.c b/mm/mlock.c
index 544c18ce2c58..d148da934fe9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -137,8 +137,8 @@ void munlock_vma_page(struct page *page)
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+static void munlock_vma_pages_range(struct vm_area_struct *vma,
+				    unsigned long start, unsigned long end)
 {
 	/* Reimplementation to follow in later commit */
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..64b5985b5295 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2674,6 +2674,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	vma->vm_prev = NULL;
 	do {
 		vma_rb_erase(vma, &mm->mm_rb);
+		if (vma->vm_flags & VM_LOCKED)
+			mm->locked_vm -= vma_pages(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
@@ -2778,22 +2780,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-static inline void
-unlock_range(struct vm_area_struct *start, unsigned long limit)
-{
-	struct mm_struct *mm = start->vm_mm;
-	struct vm_area_struct *tmp = start;
-
-	while (tmp && tmp->vm_start < limit) {
-		if (tmp->vm_flags & VM_LOCKED) {
-			mm->locked_vm -= vma_pages(tmp);
-			munlock_vma_pages_all(tmp);
-		}
-
-		tmp = tmp->vm_next;
-	}
-}
-
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2874,12 +2860,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 			return error;
 	}
 
-	/*
-	 * unlock any mlock()ed ranges before detaching vmas
-	 */
-	if (mm->locked_vm)
-		unlock_range(vma, end);
-
 	/* Detach vmas from rbtree */
 	if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
 		downgrade = false;
@@ -3147,20 +3127,12 @@ void exit_mmap(struct mm_struct *mm)
 		 * Nothing can be holding mm->mmap_lock here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling unlock_range(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
-
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 	}
 
 	mmap_write_lock(mm);
-	if (mm->locked_vm)
-		unlock_range(mm->mmap, ULONG_MAX);
-
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..6b875acabd1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -526,7 +526,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_lru_vma(vma))
+		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (2 preceding siblings ...)
  2022-02-06 21:34 ` [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap Hugh Dickins
@ 2022-02-06 21:36 ` Hugh Dickins
  2022-02-11 10:29   ` Vlastimil Babka
  2022-02-06 21:38 ` [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance Hugh Dickins
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/rmap.h    | 17 +++++++------
 kernel/events/uprobes.c |  7 ++----
 mm/huge_memory.c        | 17 ++++++-------
 mm/hugetlb.c            |  4 +--
 mm/internal.h           | 36 ++++++++++++++++++++++----
 mm/khugepaged.c         |  4 +--
 mm/ksm.c                | 12 +--------
 mm/memory.c             | 45 +++++++++++----------------------
 mm/migrate.c            |  9 ++-----
 mm/mlock.c              | 21 ++++++----------
 mm/rmap.c               | 56 +++++++++++++++++++----------------------
 mm/userfaultfd.c        | 14 ++++++-----
 12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long address, bool compound);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+		unsigned long address, int flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+		unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long);
+		unsigned long address);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-				unsigned long);
+		unsigned long address);
 
 static inline void page_dup_rmap(struct page *page, bool compound)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 			return err;
 	}
 
-	/* For try_to_free_swap() and munlock_vma_page() below */
+	/* For try_to_free_swap() below */
 	lock_page(old_page);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, vma, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
-
-	if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
-		munlock_vma_page(old_page);
 	put_page(old_page);
 
 	err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				set_page_dirty(page);
 			if (!PageReferenced(page) && pmd_young(old_pmd))
 				SetPageReferenced(page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 		}
 		unlock_page_memcg(page);
+
+		/* Above is effectively page_remove_rmap(page, vma, true) */
+		munlock_vma_page(page, vma, true);
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, vma, false);
 			put_page(page + i);
 		}
 	}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				do_unlock_page = true;
 			}
 		}
-		if (PageMlocked(page))
-			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	put_page(page);
 }
 
@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (PageAnon(new))
 		page_add_anon_rmap(new, vma, mmun_start, true);
 	else
-		page_add_file_rmap(new, true);
+		page_add_file_rmap(new, vma, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
-	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
-
 /*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
  */
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		munlock_page(page);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 */
 			spin_lock(ptl);
 			ptep_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, vma, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, vma, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			err = replace_page(vma, page, kpage, orig_pte);
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
-		munlock_vma_page(page);
-		if (!PageMlocked(kpage)) {
-			unlock_page(page);
-			lock_page(kpage);
-			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
-		}
-	}
-
 out_unlock:
 	unlock_page(page);
 out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
-	if (vma->vm_flags & VM_LOCKED)
-		mlock_vma_page(page);
-
 	/*
 	 * No need to invalidate - it was non-present before. However
 	 * secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-
 			if (is_device_private_entry(entry))
-				page_remove_rmap(page, false);
-
+				page_remove_rmap(page, vma, false);
 			put_page(page);
 			continue;
 		}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
 	return 0;
 }
 
-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	if (!pte_none(*pte))
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	page_add_file_rmap(page, vma, false);
+	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
 }
 
@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 			struct page *page, pgprot_t prot)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int retval;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (retval)
 		goto out;
 	retval = -ENOMEM;
-	pte = get_locked_pte(mm, addr, &ptl);
+	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
 }
 
 #ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
 	err = validate_page_before_insert(page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 		start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
 		for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
-			int err = insert_page_in_batch_locked(mm, pte,
+			int err = insert_page_in_batch_locked(vma, pte,
 				addr, pages[curr_page_idx], prot);
 			if (unlikely(err)) {
 				pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, vma, false);
 		}
 
 		/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 	if (old_page) {
-		/*
-		 * Don't let another task, with possibly unlocked vma,
-		 * keep the mlocked page.
-		 */
-		if (page_copied && (vma->vm_flags & VM_LOCKED)) {
-			lock_page(old_page);	/* LRU manipulation */
-			if (PageMlocked(old_page))
-				munlock_vma_page(old_page);
-			unlock_page(old_page);
-		}
 		if (page_copied)
 			free_swap_cache(old_page);
 		put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
 	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
-	page_add_file_rmap(page, true);
+	page_add_file_rmap(page, vma, true);
+
 	/*
 	 * deposit and withdraw with pmd lock held
 	 */
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, false);
+		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			if (PageAnon(new))
 				page_add_anon_rmap(new, vma, pvmw.address, false);
 			else
-				page_add_file_rmap(new, false);
+				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
-		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(new);
-
-		if (PageTransHuge(page) && PageMlocked(page))
-			clear_page_mlock(page);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 
 			if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index d148da934fe9..aaded15b2f8f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page - page to be mlocked, either a normal page or a THP head.
  */
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
 
 	if (!TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
 }
 
 /**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
  */
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page))) {
+	if (unlikely(PageKsm(page)))
 		unlock_page_memcg(page);
-		return;
-	}
 
 	/* address might be in next vma when migration races vma_adjust */
-	if (first)
+	else if (first)
 		__page_set_anon_rmap(page, vma, address,
 				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 /**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,
 
 /**
  * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	int i, nr = 1;
 
@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
-			struct page *head = compound_head(page);
-
 			VM_WARN_ON_ONCE(!PageLocked(page));
-
-			SetPageDoubleMap(head);
-			if (PageMlocked(page))
-				clear_page_mlock(head);
+			SetPageDoubleMap(compound_head(page));
 		}
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 out:
 	unlock_page_memcg(page);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	lock_page_memcg(page);
 
@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
 	 */
 out:
 	unlock_page_memcg(page);
+
+	munlock_vma_page(page, vma, compound);
 }
 
 /*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		/*
-		 * If the page is mlock()d, we cannot swap it out.
+		 * If the page is in an mlock()d vma, we must not swap it out.
 		 */
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked: so do
-			 * not set it on a DoubleMap THP, nor on an Anon THP
-			 * (which may still be PTE-mapped after DoubleMap was
-			 * cleared).  But stop unmapping even in those cases.
-			 */
-			if (!PageTransCompound(page) || (PageHead(page) &&
-			     !PageDoubleMap(page) && !PageAnon(page)))
-				mlock_vma_page(page);
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, false);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
 		}
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
 		 * There is a reference on the page for the swap entry which has
 		 * been removed, so shouldn't take another.
 		 */
-		page_remove_rmap(subpage, false);
+		page_remove_rmap(subpage, vma, false);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 
-	if (page_in_cache)
-		page_add_file_rmap(page, false);
-	else
+	if (page_in_cache) {
+		/* Usually, cache pages are already added to LRU */
+		if (newly_allocated)
+			lru_cache_add(page);
+		page_add_file_rmap(page, dst_vma, false);
+	} else {
 		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	}
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 */
 	inc_mm_counter(dst_mm, mm_counter(page));
 
-	if (newly_allocated)
-		lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (3 preceding siblings ...)
  2022-02-06 21:36 ` [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Hugh Dickins
@ 2022-02-06 21:38 ` Hugh Dickins
  2022-02-11 11:42   ` Vlastimil Babka
  2022-02-06 21:38 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Matthew Wilcox
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0.  That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0.  If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/internal.h | 12 ------------
 mm/mlock.c    | 30 ------------------------------
 mm/rmap.c     |  9 ---------
 mm/swap.c     | 32 ++++++++++++++++++++++++--------
 4 files changed, 24 insertions(+), 59 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3d7dfc8bc471..a43d79335c16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -425,17 +425,6 @@ static inline void munlock_vma_page(struct page *page,
 		munlock_page(page);
 }
 
-/*
- * Clear the page's PageMlocked().  This can be useful in a situation where
- * we want to unconditionally remove a page from the pagecache -- e.g.,
- * on truncation or freeing.
- *
- * It is legal to call this function for any page, mlocked or not.
- * If called for a page that is still mapped by mlocked vmas, all we do
- * is revert to lazy LRU behaviour -- semantics are not broken.
- */
-extern void clear_page_mlock(struct page *page);
-
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
@@ -509,7 +498,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 }
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
-static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
diff --git a/mm/mlock.c b/mm/mlock.c
index aaded15b2f8f..db936288b8a0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -48,36 +48,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/*
- *  LRU accounting for clear_page_mlock()
- */
-void clear_page_mlock(struct page *page)
-{
-	int nr_pages;
-
-	if (!TestClearPageMlocked(page))
-		return;
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-	count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
-	/*
-	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
-	 * in __pagevec_lru_add_fn().
-	 *
-	 * See __pagevec_lru_add_fn for more explanation.
-	 */
-	if (!isolate_lru_page(page)) {
-		putback_lru_page(page);
-	} else {
-		/*
-		 * We lost the race. the page already moved to evictable list.
-		 */
-		if (PageUnevictable(page))
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	}
-}
-
 /**
  * mlock_page - mlock a page
  * @page - page to be mlocked, either a normal page or a THP head.
diff --git a/mm/rmap.c b/mm/rmap.c
index 6cc8bf129f18..5442a5c97a85 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1315,9 +1315,6 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
-
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
@@ -1357,9 +1354,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		nr = thp_nr_pages(page);
 	}
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
 }
@@ -1398,9 +1392,6 @@ void page_remove_rmap(struct page *page,
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..ff4810e4a4bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,8 +74,8 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 };
 
 /*
- * This path almost never happens for VM activity - pages are normally
- * freed via pagevecs.  But it gets used by networking.
+ * This path almost never happens for VM activity - pages are normally freed
+ * via pagevecs.  But it gets used by networking - and for compound pages.
  */
 static void __page_cache_release(struct page *page)
 {
@@ -89,6 +89,14 @@ static void __page_cache_release(struct page *page)
 		__clear_page_lru_flags(page);
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
+	/* See comment on PageMlocked in release_pages() */
+	if (unlikely(PageMlocked(page))) {
+		int nr_pages = thp_nr_pages(page);
+
+		__ClearPageMlocked(page);
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+	}
 	__ClearPageWaiters(page);
 }
 
@@ -489,12 +497,8 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
 	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
-		/*
-		 * We use the irq-unsafe __mod_zone_page_state because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
 		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 	lru_cache_add(page);
@@ -969,6 +973,18 @@ void release_pages(struct page **pages, int nr)
 			__clear_page_lru_flags(page);
 		}
 
+		/*
+		 * In rare cases, when truncation or holepunching raced with
+		 * munlock after VM_LOCKED was cleared, Mlocked may still be
+		 * found set here.  This does not indicate a problem, unless
+		 * "unevictable_pgs_cleared" appears worryingly large.
+		 */
+		if (unlikely(PageMlocked(page))) {
+			__ClearPageMlocked(page);
+			dec_zone_page_state(page, NR_MLOCK);
+			count_vm_event(UNEVICTABLE_PGCLEARED);
+		}
+
 		__ClearPageWaiters(page);
 
 		list_add(&page->lru, &pages_to_free);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (4 preceding siblings ...)
  2022-02-06 21:38 ` [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance Hugh Dickins
@ 2022-02-06 21:38 ` Matthew Wilcox
  2022-02-07 18:20   ` Hugh Dickins
  2022-02-06 21:40 ` [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable Hugh Dickins
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Matthew Wilcox @ 2022-02-06 21:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote:
> Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> likely to conflict (I hope not fundamentally) with several concurrent
> large patchsets.

Most of this patchset hasn't arrived here yet, but I would be
_delighted_ to rebase the folio conversion on top of this.  What
convoluted code it is!  I am so glad you've cleaned this up; I was
dreading doing the rest of the mlock file.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (5 preceding siblings ...)
  2022-02-06 21:38 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Matthew Wilcox
@ 2022-02-06 21:40 ` Hugh Dickins
  2022-02-11 12:27   ` Vlastimil Babka
  2022-02-11 18:07   ` Vlastimil Babka
  2022-02-06 21:42 ` [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking Hugh Dickins
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done.  As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm_inline.h | 11 +++++--
 include/linux/mm_types.h  | 19 +++++++++--
 mm/huge_memory.c          |  5 ++-
 mm/memcontrol.c           |  3 +-
 mm/mlock.c                | 68 +++++++++++++++++++++++++++++++--------
 mm/mmzone.c               |  7 ++++
 mm/swap.c                 |  1 +
 7 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index b725839dfe71..884d6f6af05b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -99,7 +99,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
-	list_add(&folio->lru, &lruvec->lists[lru]);
+	if (lru != LRU_UNEVICTABLE)
+		list_add(&folio->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list(struct page *page,
@@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
+	/* This is not expected to be used on LRU_UNEVICTABLE */
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
 }
 
@@ -127,8 +129,11 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
-	list_del(&folio->lru);
-	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
+	enum lru_list lru = folio_lru_list(folio);
+
+	if (lru != LRU_UNEVICTABLE)
+		list_del(&folio->lru);
+	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			-folio_nr_pages(folio));
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..475bdb282769 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,7 +85,16 @@ struct page {
 			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				/* Or, for the Unevictable "LRU list" slot */
+				struct {
+					/* Always even, to negate PageTail */
+					void *__filler;
+					/* Count page's or folio's mlocks */
+					unsigned int mlock_count;
+				};
+			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
@@ -241,7 +250,13 @@ struct folio {
 		struct {
 	/* public: */
 			unsigned long flags;
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				struct {
+					void *__filler;
+					unsigned int mlock_count;
+				};
+			};
 			struct address_space *mapping;
 			pgoff_t index;
 			void *private;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6477f48a27e..9afca0122723 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2300,8 +2300,11 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	} else {
 		/* head is still on lru (and we have it frozen) */
 		VM_WARN_ON(!PageLRU(head));
+		if (PageUnevictable(tail))
+			tail->mlock_count = 0;
+		else
+			list_add_tail(&tail->lru, &head->lru);
 		SetPageLRU(tail);
-		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..b10590926177 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,8 +1257,7 @@ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
  * @nr_pages: positive when adding or negative when removing
  *
  * This function must be called under lru_lock, just before a page is added
- * to or just after a page is removed from an lru list (that ordering being
- * so as to allow it to check that lru_size 0 is consistent with list_empty).
+ * to or just after a page is removed from an lru list.
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int zid, int nr_pages)
diff --git a/mm/mlock.c b/mm/mlock.c
index db936288b8a0..0d3ae04b1f4e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -54,16 +54,35 @@ EXPORT_SYMBOL(can_do_mlock);
  */
 void mlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (!TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
 		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-		if (!isolate_lru_page(page))
-			putback_lru_page(page);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return;
+
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageUnevictable(page)) {
+		page->mlock_count++;
+		goto out;
 	}
+
+	del_page_from_lru_list(page, lruvec);
+	ClearPageActive(page);
+	SetPageUnevictable(page);
+	page->mlock_count = 1;
+	add_page_to_lru_list(page, lruvec);
+	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+out:
+	SetPageLRU(page);
+	unlock_page_lruvec_irq(lruvec);
 }
 
 /**
@@ -72,19 +91,40 @@ void mlock_page(struct page *page)
  */
 void munlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	lock_page_memcg(page);
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageLRU(page) && PageUnevictable(page)) {
+		/* Then mlock_count is maintained, but might undercount */
+		if (page->mlock_count)
+			page->mlock_count--;
+		if (page->mlock_count)
+			goto out;
+	}
+	/* else assume that was the last mlock: reclaim will fix it if not */
+
 	if (TestClearPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (!isolate_lru_page(page)) {
-			putback_lru_page(page);
-			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-		} else if (PageUnevictable(page)) {
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-		}
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (PageLRU(page) || !PageUnevictable(page))
+			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		else
+			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
+	}
+
+	/* page_evictable() has to be checked *after* clearing Mlocked */
+	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+		del_page_from_lru_list(page, lruvec);
+		ClearPageUnevictable(page);
+		add_page_to_lru_list(page, lruvec);
+		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
+out:
+	unlock_page_lruvec_irq(lruvec);
+	unlock_page_memcg(page);
 }
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..40e1d9428300 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,13 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+	/*
+	 * The "Unevictable LRU" is imaginary: though its size is maintained,
+	 * it is never scanned, and unevictable pages are not threaded on it
+	 * (so that their lru fields can be reused to hold mlock_count).
+	 * Poison its list head, so that any operations on it would crash.
+	 */
+	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index ff4810e4a4bc..682a03301a2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1062,6 +1062,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
+		folio->mlock_count = !!folio_test_mlocked(folio);
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (6 preceding siblings ...)
  2022-02-06 21:40 ` [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable Hugh Dickins
@ 2022-02-06 21:42 ` Hugh Dickins
  2022-02-11 16:45   ` Vlastimil Babka
  2022-02-06 21:43 ` [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU Hugh Dickins
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range().  This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time.  This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/internal.h |   3 +-
 mm/mlock.c    | 108 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	/* VM_IO check prevents migration from double-counting during mlock */
+	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
 	    (compound || !PageTransCompound(page)))
 		mlock_page(page);
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index 0d3ae04b1f4e..f8e5dcff21ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
+#include <linux/pagewalk.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
@@ -127,23 +128,90 @@ void munlock_page(struct page *page)
 	unlock_page_memcg(page);
 }
 
+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+			   unsigned long end, struct mm_walk *walk)
+
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *start_pte, *pte;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (!pmd_present(*pmd))
+			goto out;
+		if (is_huge_zero_pmd(*pmd))
+			goto out;
+		page = pmd_page(*pmd);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+		goto out;
+	}
+
+	start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+		if (PageTransCompound(page))
+			continue;
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+	}
+	pte_unmap(start_pte);
+out:
+	spin_unlock(ptl);
+	cond_resched();
+	return 0;
+}
+
 /*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ *                           or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
  * @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- *  For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
  *
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared.  Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
  */
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+	unsigned long start, unsigned long end, vm_flags_t newflags)
 {
-	/* Reimplementation to follow in later commit */
+	static const struct mm_walk_ops mlock_walk_ops = {
+		.pmd_entry = mlock_pte_range,
+	};
+
+	/*
+	 * There is a slight chance that concurrent page migration,
+	 * or page reclaim finding a page of this now-VM_LOCKED vma,
+	 * will call mlock_vma_page() and raise page's mlock_count:
+	 * double counting, leaving the page unevictable indefinitely.
+	 * Communicate this danger to mlock_vma_page() with VM_IO,
+	 * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+	 * mmap_lock is held in write mode here, so this weird
+	 * combination should not be visible to others.
+	 */
+	if (newflags & VM_LOCKED)
+		newflags |= VM_IO;
+	WRITE_ONCE(vma->vm_flags, newflags);
+
+	lru_add_drain();
+	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+	lru_add_drain();
+
+	if (newflags & VM_IO) {
+		newflags &= ~VM_IO;
+		WRITE_ONCE(vma->vm_flags, newflags);
+	}
 }
 
 /*
@@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff_t pgoff;
 	int nr_pages;
 	int ret = 0;
-	int lock = !!(newflags & VM_LOCKED);
-	vm_flags_t old_flags = vma->vm_flags;
+	vm_flags_t oldflags = vma->vm_flags;
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
@@ -197,9 +264,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * Keep track of amount of locked VM.
 	 */
 	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
+	if (!(newflags & VM_LOCKED))
 		nr_pages = -nr_pages;
-	else if (old_flags & VM_LOCKED)
+	else if (oldflags & VM_LOCKED)
 		nr_pages = 0;
 	mm->locked_vm += nr_pages;
 
@@ -209,11 +276,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
 
-	if (lock)
+	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+		/* No work to do, and mlocking twice would be wrong */
 		vma->vm_flags = newflags;
-	else
-		munlock_vma_pages_range(vma, start, end);
-
+	} else {
+		mlock_vma_pages_range(vma, start, end, newflags);
+	}
 out:
 	*prev = vma;
 	return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (7 preceding siblings ...)
  2022-02-06 21:42 ` [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking Hugh Dickins
@ 2022-02-06 21:43 ` Hugh Dickins
  2022-02-11 17:19   ` Vlastimil Babka
  2022-02-06 21:45 ` [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn() Hugh Dickins
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called.  Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/migrate.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7c4223ce2500..f4bcf1541b62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1032,6 +1032,21 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page, mode);
 
+	/*
+	 * When successful, push newpage to LRU immediately: so that if it
+	 * turns out to be an mlocked page, remove_migration_ptes() will
+	 * automatically build up the correct newpage->mlock_count for it.
+	 *
+	 * We would like to do something similar for the old page, when
+	 * unsuccessful, and other cases when a page has been temporarily
+	 * isolated from the unevictable LRU: but this case is the easiest.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		lru_cache_add(newpage);
+		if (page_was_mapped)
+			lru_add_drain();
+	}
+
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
@@ -1045,20 +1060,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	unlock_page(page);
 out:
 	/*
-	 * If migration is successful, decrease refcount of the newpage
+	 * If migration is successful, decrease refcount of the newpage,
 	 * which will not free the page because new page owner increased
-	 * refcounter. As well, if it is LRU page, add the page to LRU
-	 * list in here. Use the old state of the isolated source page to
-	 * determine if we migrated a LRU page. newpage was already unlocked
-	 * and possibly modified by its owner - don't rely on the page
-	 * state.
+	 * refcounter.
 	 */
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(!is_lru))
-			put_page(newpage);
-		else
-			putback_lru_page(newpage);
-	}
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_page(newpage);
 
 	return rc;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (8 preceding siblings ...)
  2022-02-06 21:43 ` [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU Hugh Dickins
@ 2022-02-06 21:45 ` Hugh Dickins
  2022-02-11 17:43   ` Vlastimil Babka
  2022-02-06 21:47 ` [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec Hugh Dickins
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/swap.c | 37 +++++++++----------------------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 682a03301a2c..3f770b1ea2c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	folio_set_lru(folio);
 	/*
-	 * A folio becomes evictable in two ways:
-	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
-	 * 2) Before acquiring LRU lock to put the folio on the correct LRU
-	 *    and then
-	 *   a) do PageLRU check with lock [check_move_unevictable_pages]
-	 *   b) do PageLRU check before lock [clear_page_mlock]
-	 *
-	 * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
-	 * following strict ordering:
-	 *
-	 * #0: __pagevec_lru_add_fn		#1: clear_page_mlock
-	 *
-	 * folio_set_lru()			folio_test_clear_mlocked()
-	 * smp_mb() // explicit ordering	// above provides strict
-	 *					// ordering
-	 * folio_test_mlocked()			folio_test_lru()
+	 * Is an smp_mb__after_atomic() still required here, before
+	 * folio_evictable() tests PageMlocked, to rule out the possibility
+	 * of stranding an evictable folio on an unevictable LRU?  I think
+	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * lock is held.
 	 *
-	 *
-	 * if '#1' does not observe setting of PG_lru by '#0' and
-	 * fails isolation, the explicit barrier will make sure that
-	 * folio_evictable check will put the folio on the correct
-	 * LRU. Without smp_mb(), folio_set_lru() can be reordered
-	 * after folio_test_mlocked() check and can make '#1' fail the
-	 * isolation of the folio whose mlocked bit is cleared (#0 is
-	 * also looking at the same folio) and the evictable folio will
-	 * be stranded on an unevictable LRU.
+	 * (That is not true of __page_cache_release(), and not necessarily
+	 * true of release_pages(): but those only clear PageMlocked after
+	 * put_page_testzero() has excluded any other users of the page.)
 	 */
-	folio_set_lru(folio);
-	smp_mb__after_atomic();
-
 	if (folio_evictable(folio)) {
 		if (was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (9 preceding siblings ...)
  2022-02-06 21:45 ` [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn() Hugh Dickins
@ 2022-02-06 21:47 ` Hugh Dickins
  2022-02-09  8:15   ` Geert Uytterhoeven
  2022-02-11 18:26   ` Vlastimil Babka
  2022-02-06 21:49 ` [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained Hugh Dickins
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/internal.h |   7 +-
 mm/mlock.c    | 212 ++++++++++++++++++++++++++++++++++++++++++--------
 mm/swap.c     |  27 ++++---
 3 files changed, 199 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3f0dd3ffba2..5817cabb6343 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
  *
  * mlock is usually called at the end of page_add_*_rmap(),
  * munlock at the end of page_remove_rmap(); but new anon
- * pages are managed in lru_cache_add_inactive_or_unevictable().
+ * pages are managed by lru_cache_add_inactive_or_unevictable()
+ * calling mlock_new_page().
  *
  * @compound is used to include pmd mappings of THPs, but filter out
  * pte mappings of THPs, which cannot be consistently counted: a pte
@@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
 	    (compound || !PageTransCompound(page)))
 		munlock_page(page);
 }
+void mlock_new_page(struct page *page);
+bool need_mlock_page_drain(int cpu);
+void mlock_page_drain(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
@@ -503,6 +507,7 @@ static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
+static inline void mlock_new_page(struct page *page) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index f8e5dcff21ae..d50d48961b22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -28,6 +28,8 @@
 
 #include "internal.h"
 
+static DEFINE_PER_CPU(struct pagevec, mlock_pvec);
+
 bool can_do_mlock(void)
 {
 	if (rlimit(RLIMIT_MEMLOCK) != 0)
@@ -49,57 +51,79 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/**
- * mlock_page - mlock a page
- * @page - page to be mlocked, either a normal page or a THP head.
- */
-void mlock_page(struct page *page)
+static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
-	int nr_pages = thp_nr_pages(page);
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return lruvec;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	if (unlikely(page_evictable(page))) {
+		/*
+		 * This is a little surprising, but quite possible:
+		 * PageMlocked must have got cleared already by another CPU.
+		 * Could this page be on the Unevictable LRU?  I'm not sure,
+		 * but move it now if so.
+		 */
+		if (PageUnevictable(page)) {
+			del_page_from_lru_list(page, lruvec);
+			ClearPageUnevictable(page);
+			add_page_to_lru_list(page, lruvec);
+			__count_vm_events(UNEVICTABLE_PGRESCUED,
+					  thp_nr_pages(page));
+		}
+		goto out;
 	}
 
-	/* There is nothing more we can do while it's off LRU */
-	if (!TestClearPageLRU(page))
-		return;
-
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
 	if (PageUnevictable(page)) {
-		page->mlock_count++;
+		if (PageMlocked(page))
+			page->mlock_count++;
 		goto out;
 	}
 
 	del_page_from_lru_list(page, lruvec);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
-	page->mlock_count = 1;
+	page->mlock_count = !!PageMlocked(page);
 	add_page_to_lru_list(page, lruvec);
-	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
 out:
 	SetPageLRU(page);
-	unlock_page_lruvec_irq(lruvec);
+	return lruvec;
 }
 
-/**
- * munlock_page - munlock a page
- * @page: page to be munlocked, either a normal page or a THP head.
- */
-void munlock_page(struct page *page)
+static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	/* As above, this is a little surprising, but possible */
+	if (unlikely(page_evictable(page)))
+		goto out;
+
+	SetPageUnevictable(page);
+	page->mlock_count = !!PageMlocked(page);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
+out:
+	add_page_to_lru_list(page, lruvec);
+	SetPageLRU(page);
+	return lruvec;
+}
+
+static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
 	int nr_pages = thp_nr_pages(page);
+	bool isolated = false;
+
+	if (!TestClearPageLRU(page))
+		goto munlock;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	isolated = true;
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	lock_page_memcg(page);
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
-	if (PageLRU(page) && PageUnevictable(page)) {
+	if (PageUnevictable(page)) {
 		/* Then mlock_count is maintained, but might undercount */
 		if (page->mlock_count)
 			page->mlock_count--;
@@ -108,24 +132,144 @@ void munlock_page(struct page *page)
 	}
 	/* else assume that was the last mlock: reclaim will fix it if not */
 
+munlock:
 	if (TestClearPageMlocked(page)) {
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (PageLRU(page) || !PageUnevictable(page))
+		if (isolated || !PageUnevictable(page))
 			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
 		else
 			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	}
 
 	/* page_evictable() has to be checked *after* clearing Mlocked */
-	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+	if (isolated && PageUnevictable(page) && page_evictable(page)) {
 		del_page_from_lru_list(page, lruvec);
 		ClearPageUnevictable(page);
 		add_page_to_lru_list(page, lruvec);
 		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
 out:
-	unlock_page_lruvec_irq(lruvec);
-	unlock_page_memcg(page);
+	if (isolated)
+		SetPageLRU(page);
+	return lruvec;
+}
+
+/*
+ * Flags held in the low bits of a struct page pointer on the mlock_pvec.
+ */
+#define LRU_PAGE 0x1
+#define NEW_PAGE 0x2
+#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
+#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
+
+/*
+ * mlock_pagevec() is derived from pagevec_lru_move_fn():
+ * perhaps that can make use of such page pointer flags in future,
+ * but for now just keep it for mlock.  We could use three separate
+ * pagevecs instead, but one feels better (munlocking a full pagevec
+ * does not need to drain mlocking pagevecs first).
+ */
+static void mlock_pagevec(struct pagevec *pvec)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long mlock;
+	struct page *page;
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE);
+		page = (struct page *)((unsigned long)page - mlock);
+		pvec->pages[i] = page;
+
+		if (mlock & LRU_PAGE)
+			lruvec = __mlock_page(page, lruvec);
+		else if (mlock & NEW_PAGE)
+			lruvec = __mlock_new_page(page, lruvec);
+		else
+			lruvec = __munlock_page(page, lruvec);
+	}
+
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
+}
+
+void mlock_page_drain(int cpu)
+{
+	struct pagevec *pvec;
+
+	pvec = &per_cpu(mlock_pvec, cpu);
+	if (pagevec_count(pvec))
+		mlock_pagevec(pvec);
+}
+
+bool need_mlock_page_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(mlock_pvec, cpu));
+}
+
+/**
+ * mlock_page - mlock a page already on (or temporarily off) LRU
+ * @page - page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	if (!TestSetPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_lru(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * mlock_new_page - mlock a newly allocated page not yet on LRU
+ * @page - page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_new_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+	int nr_pages = thp_nr_pages(page);
+
+	SetPageMlocked(page);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_new(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * munlock_page - munlock a page
+ * @page - page to be munlocked, either a normal page or a THP head.
+ */
+void munlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	/*
+	 * TestClearPageMlocked(page) must be left to __munlock_page(),
+	 * which will check whether the page is multiply mlocked.
+	 */
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
 }
 
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
diff --git a/mm/swap.c b/mm/swap.c
index 3f770b1ea2c1..842d5cd92cf6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,18 +490,12 @@ EXPORT_SYMBOL(folio_add_lru);
 void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
-	bool unevictable;
-
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
-	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-	}
-	lru_cache_add(page);
+	if (unlikely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED))
+		mlock_new_page(page);
+	else
+		lru_cache_add(page);
 }
 
 /*
@@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
+	mlock_page_drain(cpu);
 }
 
 /**
@@ -842,6 +837,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
 		    need_activate_page_drain(cpu) ||
+		    need_mlock_page_drain(cpu) ||
 		    has_bh_in_lru(cpu, NULL)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
@@ -1030,7 +1026,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	 * Is an smp_mb__after_atomic() still required here, before
 	 * folio_evictable() tests PageMlocked, to rule out the possibility
 	 * of stranding an evictable folio on an unevictable LRU?  I think
-	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * not, because __munlock_page() only clears PageMlocked while the LRU
 	 * lock is held.
 	 *
 	 * (That is not true of __page_cache_release(), and not necessarily
@@ -1043,7 +1039,14 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
-		folio->mlock_count = !!folio_test_mlocked(folio);
+		/*
+		 * folio->mlock_count = !!folio_test_mlocked(folio)?
+		 * But that leaves __mlock_page() in doubt whether another
+		 * actor has already counted the mlock or not.  Err on the
+		 * safe side, underestimate, let page reclaim fix it, rather
+		 * than leaving a page on the unevictable LRU indefinitely.
+		 */
+		folio->mlock_count = 0;
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (10 preceding siblings ...)
  2022-02-06 21:47 ` [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec Hugh Dickins
@ 2022-02-06 21:49 ` Hugh Dickins
  2022-02-11 18:49   ` Vlastimil Babka
  2022-02-06 21:51 ` [PATCH 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH) Hugh Dickins
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.  If the mlock pagevec raises doubts, we can easily add a
sysctl to tune its length to 1, which reverts to synchronous operation.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/migrate.c | 2 ++
 mm/rmap.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (11 preceding siblings ...)
  2022-02-06 21:49 ` [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained Hugh Dickins
@ 2022-02-06 21:51 ` Hugh Dickins
  2022-02-06 21:53 ` [PATCH 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP Hugh Dickins
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap().  There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue.  I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d5e387c58bde..e0883a33efd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1829,13 +1829,12 @@ static void collapse_file(struct mm_struct *mm,
 		}
 
 		if (page_mapped(page))
-			unmap_mapping_pages(mapping, index, 1, false);
+			try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH);
 
 		xas_lock_irq(&xas);
 		xas_set(&xas, index);
 
 		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
-		VM_BUG_ON_PAGE(page_mapped(page), page);
 
 		/*
 		 * The page is expected to have page_count() == 3:
@@ -1899,6 +1898,13 @@ static void collapse_file(struct mm_struct *mm,
 	xas_unlock_irq(&xas);
 xa_unlocked:
 
+	/*
+	 * If collapse is successful, flush must be done now before copying.
+	 * If collapse is unsuccessful, does flush actually need to be done?
+	 * Do it anyway, to clear the state.
+	 */
+	try_to_unmap_flush();
+
 	if (result == SCAN_SUCCEED) {
 		struct page *page, *tmp;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (12 preceding siblings ...)
  2022-02-06 21:51 ` [PATCH 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH) Hugh Dickins
@ 2022-02-06 21:53 ` Hugh Dickins
       [not found] ` <20220207033518.2602-1-hdanton@sina.com>
  2022-02-09 15:35 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Michal Hocko
  15 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-06 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

4.8 commit 7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked.  Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page?  That's debatable.  The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/rmap.c   | 7 +++++--
 mm/vmscan.c | 6 +++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 714bfdc72c7b..c7921c102bc0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,7 +812,10 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if ((vma->vm_flags & VM_LOCKED) &&
+		    (!PageTransCompound(page) || !pvmw.pte)) {
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, !pvmw.pte);
 			page_vma_mapped_walk_done(&pvmw);
 			pra->vm_flags |= VM_LOCKED;
 			return false; /* To break the loop */
@@ -851,7 +854,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 
 	if (referenced) {
 		pra->referenced++;
-		pra->vm_flags |= vma->vm_flags;
+		pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
 	}
 
 	if (!pra->mapcount)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..a160efba3c73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1386,11 +1386,11 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
-	 * Mlock lost the isolation race with us.  Let try_to_unmap()
-	 * move the page to the unevictable list.
+	 * The supposedly reclaimable page was found to be in a VM_LOCKED vma.
+	 * Let the page, now marked Mlocked, be moved to the unevictable list.
 	 */
 	if (vm_flags & VM_LOCKED)
-		return PAGEREF_RECLAIM;
+		return PAGEREF_ACTIVATE;
 
 	if (referenced_ptes) {
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-06 21:38 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Matthew Wilcox
@ 2022-02-07 18:20   ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-07 18:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Vlastimil Babka,
	Kirill A. Shutemov, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Sun, 6 Feb 2022, Matthew Wilcox wrote:
> On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote:
> > Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> > mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> > likely to conflict (I hope not fundamentally) with several concurrent
> > large patchsets.
> 
> Most of this patchset hasn't arrived here yet, but I would be
> _delighted_ to rebase the folio conversion on top of this.  What
> convoluted code it is!  I am so glad you've cleaned this up; I was
> dreading doing the rest of the mlock file.

That's a very generous offer: thank you.  I'd been looking at it
the other way round, afraid that it would be getting in your way:
but now think you're right, that the cleanup there will help your
work.

I had found mm/mlock.c unexpectedly difficult to deal with when
doing shmem huge pages, and IIRC Kirill had the same experience.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking
       [not found] ` <20220207033518.2602-1-hdanton@sina.com>
@ 2022-02-07 18:46   ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-07 18:46 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Mon, 7 Feb 2022, Hillf Danton wrote:
> On Sun, 6 Feb 2022 13:42:09 -0800 (PST) Hugh Dickins wrote:
> > +static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > +	unsigned long start, unsigned long end, vm_flags_t newflags)
> >  {
> > -	/* Reimplementation to follow in later commit */
> > +	static const struct mm_walk_ops mlock_walk_ops = {
> > +		.pmd_entry = mlock_pte_range,
> > +	};
> > +
> > +	/*
> > +	 * There is a slight chance that concurrent page migration,
> > +	 * or page reclaim finding a page of this now-VM_LOCKED vma,
> > +	 * will call mlock_vma_page() and raise page's mlock_count:
> > +	 * double counting, leaving the page unevictable indefinitely.
> > +	 * Communicate this danger to mlock_vma_page() with VM_IO,
> > +	 * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
> > +	 * mmap_lock is held in write mode here, so this weird
> > +	 * combination should not be visible to others.
> > +	 */
> > +	if (newflags & VM_LOCKED)
> > +		newflags |= VM_IO;
> > +	WRITE_ONCE(vma->vm_flags, newflags);
> 
> Nit
> 
> The WRITE_ONCE is not needed, given the certainty of invisibility to
> others - it will quiesce syzbot reporting the case of visibility.

Ah, maybe I can rewrite that comment better: when I said "visible to
others", I meant visible to "the outside world", those participating in
the usual mmap_lock'ed access, syscalls and /proc/pid/maps and smaps etc.

The point here is that some kernel low-level internals (page migration
and page reclaim) peek at vma->vm_flags without mmap_lock (but with
anon_vma lock or i_mmap_rwsem).

Originally I had VM_LOCKED set in vma->vm_flags before calling
mlock_vma_pages_range(), no need for a newflags parameter.  Then
realized that left a tiny window in which VM_LOCKED was visible to
migration and reclaim without the safening VM_IO, so changed it to pass
in newflags, then "newflags |= VM_IO", then "vma->vm_flags = newflags"
there.   Then realized that perhaps an uncooperative compiler might be
inspired to mutate that into "vma->vm_flags = newflags" followed by
"vma->vm_flags |= VM_IO".  I hope it would not, but can I be sure
that it would not?  That's why I ended up with WRITE_ONCE() there.

Maybe all rather overkill: but trying to ensure that we undercount
mmap_locked rather than risk overcounting it.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec
  2022-02-06 21:47 ` [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec Hugh Dickins
@ 2022-02-09  8:15   ` Geert Uytterhoeven
  2022-02-09 15:45     ` Hugh Dickins
  2022-02-11 18:26   ` Vlastimil Babka
  1 sibling, 1 reply; 49+ messages in thread
From: Geert Uytterhoeven @ 2022-02-09  8:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Linux MM

Hi Hugh,

On Wed, Feb 9, 2022 at 3:46 AM Hugh Dickins <hughd@google.com> wrote:
> A weakness of the page->mlock_count approach is the need for lruvec lock
> while holding page table lock.  That is not an overhead we would allow on
> normal pages, but I think acceptable just for pages in an mlocked area.
> But let's try to amortize the extra cost by gathering on per-cpu pagevec
> before acquiring the lruvec lock.
>
> I have an unverified conjecture that the mlock pagevec might work out
> well for delaying the mlock processing of new file pages until they have
> got off lru_cache_add()'s pagevec and on to LRU.
>
> The initialization of page->mlock_count is subject to races and awkward:
> 0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
> this commit, which just widens the window?  I haven't gone back to think
> it through.  Maybe someone can point out a better way to initialize it.
>
> Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
> into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
> rather than lru_cache_add()'s pagevec.
>
> Experimented with various orderings: the right thing seems to be for
> mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
> pagevec, but munlock_page() to leave TestClearPageMlocked to the later
> pagevec processing.
>
> Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
> their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
> for that.
>
> This still leaves acquiring lruvec locks under page table lock each time
> the pagevec fills (or a THP is added): which I suppose is rather silly,
> since they sit on pagevec waiting to be processed long after page table
> lock has been dropped; but I'm disinclined to uglify the calling sequence
> until some load shows an actual problem with it (nothing wrong with
> taking lruvec lock under page table lock, just "nicer" to do it less).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Thanks for your patch, which is now commit cbaf47432c909044
("mm/munlock: mlock_page() munlock_page() batch by pagevec") in
next-20220209.

> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>   *
>   * mlock is usually called at the end of page_add_*_rmap(),
>   * munlock at the end of page_remove_rmap(); but new anon
> - * pages are managed in lru_cache_add_inactive_or_unevictable().
> + * pages are managed by lru_cache_add_inactive_or_unevictable()
> + * calling mlock_new_page().
>   *
>   * @compound is used to include pmd mappings of THPs, but filter out
>   * pte mappings of THPs, which cannot be consistently counted: a pte
> @@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
>             (compound || !PageTransCompound(page)))
>                 munlock_page(page);
>  }
> +void mlock_new_page(struct page *page);
> +bool need_mlock_page_drain(int cpu);
> +void mlock_page_drain(int cpu);

This is inside an #ifdef CONFIG_MMU section.

> --- a/mm/swap.c
> +++ b/mm/swap.c

> @@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
>                 pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
>
>         activate_page_drain(cpu);
> +       mlock_page_drain(cpu);

noreply@ellerman.id.au reported for m5272c3_defconfig:

    mm/swap.c:637:2: error: implicit declaration of function
‘mlock_page_drain’ [-Werror=implicit-function-declaration]

http://kisskb.ellerman.id.au/kisskb/buildresult/14694567/

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
                   ` (14 preceding siblings ...)
       [not found] ` <20220207033518.2602-1-hdanton@sina.com>
@ 2022-02-09 15:35 ` Michal Hocko
  2022-02-09 16:21   ` Hugh Dickins
  15 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2022-02-09 15:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Sun 06-02-22 13:27:41, Hugh Dickins wrote:
> I wondered whether to post this munlocking rework in
> https://lore.kernel.org/linux-mm/35c340a6-96f-28a0-2b7b-2f9fbddc01f@google.com/
> 
> There the discussion was OOM reaping, but the main reason for the rework
> has been catastrophic contention on i_mmap_rwsem when exiting from
> multiply mlocked files; and frustration with how contorted munlocking is.

yeah, I do not think that too many care about oom_reaper stumbling over
vast portions of the mlocked memory problem. So far I have seen only LTP
oom02 hitting this and that one does mlockall and intentionally hits
OOM. So far I haven't seen any actual workload hiting something similar.
But hey, if we get a better and easier to maintain mlock code then the
oom_reaper will surely add to a huge thanks.

> Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> likely to conflict (I hope not fundamentally) with several concurrent
> large patchsets.
> 
> tl;dr
>  mm/mlock.c  |  634 +++++++++++++++-----------------------
>  23 files changed, 510 insertions(+), 779 deletions(-)

This is really impressive!

> In my own opinion, this is ready to go in: that whatever its defects,
> it's a big practical improvement over what's presently there.  Others
> may differ; and it may be too much to arrive at a quick opinion on.
> My hope is that it will strike a chord with some who have laboured in
> this area in the past: I'll be short of time to do much more with it,
> so maybe someone else could take over if necessary.
> 
> At present there's no update to Documentation/vm/unevictable-lru.rst:
> that always needs a different mindset, can follow later, please refer
> to commit messages for now.

Yes, that would help for sure.

So far I have only managed to read through the series and trying to put
all the pieces together (so far I have given up on the THP part) and my
undestanding is far from complete. But I have to say I like the general
approach and overall simplification.

The only thing that is not entirely clear to me at the moment is why you
have chosen to ignore already mapped LOCKONFAULT pages. They will
eventually get sorted out during the reclaim/migration but this can
backfire if too many pages have been pre-faulted before LOCKONFAULT
call. Maybe not an interesting case in the first place but I am still
wondering why you have chosen that way.

I will be off next couple of days and plan to revisit this afterwards
(should time allow). Anyway thanks a lot Hugh!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec
  2022-02-09  8:15   ` Geert Uytterhoeven
@ 2022-02-09 15:45     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-09 15:45 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Vlastimil Babka,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Alistair Popple, Johannes Weiner, Rik van Riel,
	Suren Baghdasaryan, Yu Zhao, Greg Thelen, Shakeel Butt,
	SeongJae Park, Linux Kernel Mailing List, Linux MM

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

On Wed, 9 Feb 2022, Geert Uytterhoeven wrote:
> 
> Thanks for your patch, which is now commit cbaf47432c909044
> ("mm/munlock: mlock_page() munlock_page() batch by pagevec") in
> next-20220209.
> 
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> >   *
> >   * mlock is usually called at the end of page_add_*_rmap(),
> >   * munlock at the end of page_remove_rmap(); but new anon
> > - * pages are managed in lru_cache_add_inactive_or_unevictable().
> > + * pages are managed by lru_cache_add_inactive_or_unevictable()
> > + * calling mlock_new_page().
> >   *
> >   * @compound is used to include pmd mappings of THPs, but filter out
> >   * pte mappings of THPs, which cannot be consistently counted: a pte
> > @@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
> >             (compound || !PageTransCompound(page)))
> >                 munlock_page(page);
> >  }
> > +void mlock_new_page(struct page *page);
> > +bool need_mlock_page_drain(int cpu);
> > +void mlock_page_drain(int cpu);
> 
> This is inside an #ifdef CONFIG_MMU section.
> 
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> 
> > @@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
> >                 pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
> >
> >         activate_page_drain(cpu);
> > +       mlock_page_drain(cpu);
> 
> noreply@ellerman.id.au reported for m5272c3_defconfig:
> 
>     mm/swap.c:637:2: error: implicit declaration of function
> ‘mlock_page_drain’ [-Werror=implicit-function-declaration]
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/14694567/

Hi Geert, thanks for your report: SeongJae has posted a fix

https://lore.kernel.org/lkml/20220209094158.21941-2-sj@kernel.org/

but there I'm wondering if it's complete: I'll settle it later today.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-09 15:35 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Michal Hocko
@ 2022-02-09 16:21   ` Hugh Dickins
  2022-02-09 21:01     ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-09 16:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Wed, 9 Feb 2022, Michal Hocko wrote:
> 
> So far I have only managed to read through the series and trying to put
> all the pieces together (so far I have given up on the THP part) and my
> undestanding is far from complete. But I have to say I like the general
> approach and overall simplification.

Many thanks for looking, Michal, and for all the positivity!

> 
> The only thing that is not entirely clear to me at the moment is why you
> have chosen to ignore already mapped LOCKONFAULT pages. They will
> eventually get sorted out during the reclaim/migration but this can
> backfire if too many pages have been pre-faulted before LOCKONFAULT
> call. Maybe not an interesting case in the first place but I am still
> wondering why you have chosen that way.

I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
pages?  I'd consider that a bug.

It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
set too?  If I've got that wrong, yes, I'll need to revisit conditions.

> 
> I will be off next couple of days and plan to revisit this afterwards
> (should time allow). Anyway thanks a lot Hugh!
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-06 21:30 ` [PATCH 01/13] mm/munlock: delete page_mlock() and all its works Hugh Dickins
@ 2022-02-09 18:31   ` Vlastimil Babka
  2022-02-09 22:28     ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-09 18:31 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:30, Hugh Dickins wrote:
> We have recommended some applications to mlock their userspace, but that
> turns out to be counter-productive: when many processes mlock the same
> file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
> is needed for write, to remove any vma mapping that file from rmap's tree;
> but hogged for read by those with mlocks calling page_mlock() (formerly
> known as try_to_munlock()) on *each* page mapped from the file (the
> purpose being to find out whether another process has the page mlocked,
> so therefore it should not be unmlocked yet).
> 
> Several optimizations have been made in the past: one is to skip
> page_mlock() when mapcount tells that nothing else has this page
> mapped; but that doesn't help at all when others do have it mapped.
> This time around, I initially intended to add a preliminary search
> of the rmap tree for overlapping VM_LOCKED ranges; but that gets
> messy with locking order, when in doubt whether a page is actually
> present; and risks adding even more contention on the i_mmap_rwsem.
> 
> A solution would be much easier, if only there were space in struct page
> for an mlock_count... but actually, most of the time, there is space for
> it - an mlocked page spends most of its life on an unevictable LRU, but
> since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
> been redundant.  Let's try to reuse its page->lru.
> 
> But leave that until a later patch: in this patch, clear the ground by
> removing page_mlock(), and all the infrastructure that has gathered
> around it - which mostly hinders understanding, and will make reviewing
> new additions harder.  Don't mind those old comments about THPs, they
> date from before 4.5's refcounting rework: splitting is not a risk here.
> 
> Just keep a minimal version of munlock_vma_page(), as reminder of what it
> should attend to (in particular, the odd way PGSTRANDED is counted out of
> PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
> unchanged __mlock_posix_error_return() out of the way, down to above its
> caller: this series then makes no further change after mlock_fixup().
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

While I understand the reasons to clear the ground first, wonder what are
the implications for bisectability - is there a risk of surprising failures?
Maybe we should at least explicitly spell out the implications here?
IIUC, pages that once become mlocked, will stay mlocked, implicating the
Mlocked meminfo counter and inability to reclaim them. But if e.g. a process
that did mlockall() exits, its exclusive pages will be freed anyway, so it's
not a catastrophic kind of leak, right?
Yet it differs from the existing "failure modes" where pages would be left
as "stranded" due to failure of being isolated, because they would at least
go through TestClearPageMlocked and counters update.

>  
>  /*
> @@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>   *
>   * Returns with VM_LOCKED cleared.  Callers must be prepared to
>   * deal with this.
> - *
> - * We don't save and restore VM_LOCKED here because pages are
> - * still on lru.  In unmap path, pages might be scanned by reclaim
> - * and re-mlocked by page_mlock/try_to_unmap before we unmap and
> - * free them.  This will result in freeing mlocked pages.
>   */
>  void munlock_vma_pages_range(struct vm_area_struct *vma,
>  			     unsigned long start, unsigned long end)
>  {
> -	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

Should we at least keep doing the flags clearing? I haven't check if there
are some VM_BUG_ONs that would trip on not cleared, but wouldn't be entirely
surprised.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-09 16:21   ` Hugh Dickins
@ 2022-02-09 21:01     ` Michal Hocko
  2022-02-09 22:59       ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2022-02-09 21:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> On Wed, 9 Feb 2022, Michal Hocko wrote:
[...]
> > The only thing that is not entirely clear to me at the moment is why you
> > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > eventually get sorted out during the reclaim/migration but this can
> > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > call. Maybe not an interesting case in the first place but I am still
> > wondering why you have chosen that way.
> 
> I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> pages?  I'd consider that a bug.

I've had the following path in mind
__mm_populate
  populate_vma_page_range
    if (vma->vm_flags & VM_LOCKONFAULT)
    	return nr_page

which means that __get_user_pages is not called at all. This also means
that follow_page_mask is skipped. The page table walk used to mark
already mapped pages as mlocked so unless I miss some other path it
would stay on its original LRU list and only get real mlock protection
when encountered by the reclaim or migration.
 
> It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> set too?  If I've got that wrong, yes, I'll need to revisit conditions.

Yes, I did't really mean we would lose the mlock protection. We would
just do the lazy mlock also on pages which are already mapped. This is
certainly not a correctness issue - althoug stats might be off - but it
could polute existing LRUs with mlocked pages rather easily.

As I've said. Either I am still missing something or this might even not
be a big deal in real life. I was mostly curious about the choice to
exclude the page table walk for LOCKONFAULT.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-09 18:31   ` Vlastimil Babka
@ 2022-02-09 22:28     ` Hugh Dickins
  2022-02-10  9:52       ` Vlastimil Babka
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-09 22:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Wed, 9 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:30, Hugh Dickins wrote:
> 
> While I understand the reasons to clear the ground first, wonder what are
> the implications for bisectability - is there a risk of surprising failures?
> Maybe we should at least explicitly spell out the implications here?
> IIUC, pages that once become mlocked, will stay mlocked, implicating the
> Mlocked meminfo counter and inability to reclaim them. But if e.g. a process
> that did mlockall() exits, its exclusive pages will be freed anyway, so it's
> not a catastrophic kind of leak, right?

Thanks for taking a look, Vlastimil.  You make a good point here.

I had satisfied myself that no stage of the series was going to introduce
boot failures or BUGs; and if someone is bisecting some mlock/munlock
misbehaviour, I would not worry about which commit of the series they
alight on, but root cause it keeping all the patches in mind.

But we certainly wouldn't want the series split up into separately
submitted parts (that is, split anywhere between 01/13 and 07/13:
splitting the rest apart wouldn't matter much); and it would be
unfortunate if someone were bisecting some reclaim failure OOM problem
elsewhere, and their test app happened to be using mlock, and their
bisection landed part way between 01 and 07 here - the unimplemented
munlock confusing the bisection for OOM.

The worst of it would be, I think, landing between 05 and 07: where
your mlockall could mlock a variety of shared libraries etc, moving
all their pages to unevictable, and those pagecache pages not getting
moved back to evictable when unmapped.  I forget the current shrinker
situation, whether inode pressure could evict that pagecache or not.

Two mitigations come to mind, let me think on it some more (and hope
nobody's bisecting OOMs meanwhile): one might be to shift 05 (the one
which replaces clear_page_inode() on last unmap by clearance when
freeing) later in the series - its position was always somewhat
arbitrary, but that position is where it's been tested; another might
be to put nothing at all on the unevictable list in between 01 and 07.

Though taking this apart and putting it back together brings its own
dangers.  That second suggestion probably won't fly very well, with
06/13 using mlock_count only while on the unevictable.  I'm not sure
how much rethinking the bisection possibility deserves.

> Yet it differs from the existing "failure modes" where pages would be left
> as "stranded" due to failure of being isolated, because they would at least
> go through TestClearPageMlocked and counters update.
> 
> >  
> >  /*
> > @@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
> >   *
> >   * Returns with VM_LOCKED cleared.  Callers must be prepared to
> >   * deal with this.
> > - *
> > - * We don't save and restore VM_LOCKED here because pages are
> > - * still on lru.  In unmap path, pages might be scanned by reclaim
> > - * and re-mlocked by page_mlock/try_to_unmap before we unmap and
> > - * free them.  This will result in freeing mlocked pages.
> >   */
> >  void munlock_vma_pages_range(struct vm_area_struct *vma,
> >  			     unsigned long start, unsigned long end)
> >  {
> > -	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> 
> Should we at least keep doing the flags clearing? I haven't check if there
> are some VM_BUG_ONs that would trip on not cleared, but wouldn't be entirely
> surprised.

There are two flags in question here, VM_LOCKED and VM_LOCKONFAULT:
I'm not sure which of them you're particularly concerned about.

As to VM_LOCKED: yes, you're right, at this stage of the series the
munlock really ought to be clearing VM_LOCKED (even while it doesn't
go on to do anything about the pages), as it claims in the comment above.
I removed this line at a later stage (07/13), when changing it to
mlock_vma_pages_range() serving both mlock and munlock according to
whether VM_LOCKED is provided - and mistakenly folded back that deletion
to this patch.  End result the same, but better to restore that maskout
in this patch, as you suggest.

As to VM_LOCKONFAULT: I had checked the rest of mm/mlock.c, and the
rest of the tree, and it only ever reached here along with VM_LOCKED;
so when in 07/13 I switched over to "vma->vm_flags = newflags" (or
WRITE_ONCE equivalent), I just didn't see the need to mask it out in
the munlocking case; but could add a VM_BUG_ON that newflags never
has it without VM_LOCKED, if you like. 

(You'll say VM_WARN_ON_ONCE, I'll say VM_BUG_ON because it never happens,
then as soon as I put it in and run LTP or kselftests, I'll be ashamed
to discover I've got it wrong, perhaps.)

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-09 21:01     ` Michal Hocko
@ 2022-02-09 22:59       ` Hugh Dickins
  2022-02-10  7:49         ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-09 22:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Wed, 9 Feb 2022, Michal Hocko wrote:
> On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> > On Wed, 9 Feb 2022, Michal Hocko wrote:
> [...]
> > > The only thing that is not entirely clear to me at the moment is why you
> > > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > > eventually get sorted out during the reclaim/migration but this can
> > > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > > call. Maybe not an interesting case in the first place but I am still
> > > wondering why you have chosen that way.
> > 
> > I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> > pages?  I'd consider that a bug.
> 
> I've had the following path in mind
> __mm_populate
>   populate_vma_page_range
>     if (vma->vm_flags & VM_LOCKONFAULT)
>     	return nr_page
> 
> which means that __get_user_pages is not called at all. This also means
> that follow_page_mask is skipped. The page table walk used to mark
> already mapped pages as mlocked so unless I miss some other path it
> would stay on its original LRU list and only get real mlock protection
> when encountered by the reclaim or migration.

That is so until 04/13: but then, whenever a page is faulted into a
VM_LOCKED area (except when skipping pte-of-compound) it goes through
mlock_page().  So it will then lock-on-fault.

Ah, but you're worrying about any previously-mapped pages when
VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
in 07/13, same as when VM_LOCKED is established - I checked again,
VM_LOCKONFAULT is not set without VM_LOCKED.

>  
> > It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> > set too?  If I've got that wrong, yes, I'll need to revisit conditions.
> 
> Yes, I did't really mean we would lose the mlock protection. We would
> just do the lazy mlock also on pages which are already mapped. This is
> certainly not a correctness issue - althoug stats might be off - but it
> could polute existing LRUs with mlocked pages rather easily.

Even with the lazy mlock in reclaim, I'd still accuse myself of a bug
if I've got this wrong.

> 
> As I've said. Either I am still missing something or this might even not
> be a big deal in real life. I was mostly curious about the choice to
> exclude the page table walk for LOCKONFAULT.

It surprised me too: but looking at how FOLL_POPULATE was handled in
faultin_page(), it appeared to me to become redundant, once mlocking
was counted at fault time (and in mlock_pte_range() - maybe that's
the page table walk you're looking for, not excluded but moved).

It is a big deal for me: please don't let me fool you, please do
continue to worry at it until you're convinced or you convince me.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling
  2022-02-09 22:59       ` Hugh Dickins
@ 2022-02-10  7:49         ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2022-02-10  7:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Wed 09-02-22 14:59:19, Hugh Dickins wrote:
[...]
> Ah, but you're worrying about any previously-mapped pages when
> VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
> in 07/13, same as when VM_LOCKED is established - I checked again,
> VM_LOCKONFAULT is not set without VM_LOCKED.

Right you are. mlock_fixup->mlock_vma_pages_range is the missing piece.
This is called for both do_mlock and mlockall paths. So all good and
thanks for the clarification!

I will get back to this sometimes next week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-09 22:28     ` Hugh Dickins
@ 2022-02-10  9:52       ` Vlastimil Babka
  2022-02-14  6:52         ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-10  9:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/9/22 23:28, Hugh Dickins wrote:
> On Wed, 9 Feb 2022, Vlastimil Babka wrote:
>> On 2/6/22 22:30, Hugh Dickins wrote:
> Thanks for taking a look, Vlastimil.  You make a good point here.
> 
> I had satisfied myself that no stage of the series was going to introduce
> boot failures or BUGs; and if someone is bisecting some mlock/munlock
> misbehaviour, I would not worry about which commit of the series they
> alight on, but root cause it keeping all the patches in mind.
> 
> But we certainly wouldn't want the series split up into separately
> submitted parts (that is, split anywhere between 01/13 and 07/13:
> splitting the rest apart wouldn't matter much); and it would be
> unfortunate if someone were bisecting some reclaim failure OOM problem
> elsewhere, and their test app happened to be using mlock, and their
> bisection landed part way between 01 and 07 here - the unimplemented
> munlock confusing the bisection for OOM.
> 
> The worst of it would be, I think, landing between 05 and 07: where
> your mlockall could mlock a variety of shared libraries etc, moving
> all their pages to unevictable, and those pagecache pages not getting
> moved back to evictable when unmapped.  I forget the current shrinker
> situation, whether inode pressure could evict that pagecache or not.
> 
> Two mitigations come to mind, let me think on it some more (and hope
> nobody's bisecting OOMs meanwhile): one might be to shift 05 (the one
> which replaces clear_page_inode() on last unmap by clearance when
> freeing) later in the series - its position was always somewhat
> arbitrary, but that position is where it's been tested; another might
> be to put nothing at all on the unevictable list in between 01 and 07.
> 
> Though taking this apart and putting it back together brings its own
> dangers.  That second suggestion probably won't fly very well, with
> 06/13 using mlock_count only while on the unevictable.  I'm not sure
> how much rethinking the bisection possibility deserves.

Right, if it's impractical to change for a potential and hopefully unlikely
bad bisection luck, just a note at the end of each patch's changelog
mentioning what temporarily doesn't work, should be enough.

>> Yet it differs from the existing "failure modes" where pages would be left
>> as "stranded" due to failure of being isolated, because they would at least
>> go through TestClearPageMlocked and counters update.
>> 
>> >  
>> >  /*
>> > @@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>> >   *
>> >   * Returns with VM_LOCKED cleared.  Callers must be prepared to
>> >   * deal with this.
>> > - *
>> > - * We don't save and restore VM_LOCKED here because pages are
>> > - * still on lru.  In unmap path, pages might be scanned by reclaim
>> > - * and re-mlocked by page_mlock/try_to_unmap before we unmap and
>> > - * free them.  This will result in freeing mlocked pages.
>> >   */
>> >  void munlock_vma_pages_range(struct vm_area_struct *vma,
>> >  			     unsigned long start, unsigned long end)
>> >  {
>> > -	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
>> 
>> Should we at least keep doing the flags clearing? I haven't check if there
>> are some VM_BUG_ONs that would trip on not cleared, but wouldn't be entirely
>> surprised.
> 
> There are two flags in question here, VM_LOCKED and VM_LOCKONFAULT:
> I'm not sure which of them you're particularly concerned about.

Well, either of those, but I said I didn't dig for possible consequences as
simply not removing line above looked simpler and matched the comment.

> As to VM_LOCKED: yes, you're right, at this stage of the series the
> munlock really ought to be clearing VM_LOCKED (even while it doesn't
> go on to do anything about the pages), as it claims in the comment above.
> I removed this line at a later stage (07/13), when changing it to
> mlock_vma_pages_range() serving both mlock and munlock according to
> whether VM_LOCKED is provided - and mistakenly folded back that deletion
> to this patch.  End result the same, but better to restore that maskout
> in this patch, as you suggest.

Great, thanks. That restores any effect on VM_LOCKONFAULT in any case as well.

> As to VM_LOCKONFAULT: I had checked the rest of mm/mlock.c, and the
> rest of the tree, and it only ever reached here along with VM_LOCKED;
> so when in 07/13 I switched over to "vma->vm_flags = newflags" (or
> WRITE_ONCE equivalent), I just didn't see the need to mask it out in
> the munlocking case; but could add a VM_BUG_ON that newflags never
> has it without VM_LOCKED, if you like. 
> 
> (You'll say VM_WARN_ON_ONCE, I'll say VM_BUG_ON because it never happens,
> then as soon as I put it in and run LTP or kselftests, I'll be ashamed
> to discover I've got it wrong, perhaps.)

Wasn't suggesting new VM_BUG_ONs just worried if the patch were breaking any
existing ones.

> Hugh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
  2022-02-06 21:32 ` [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE Hugh Dickins
@ 2022-02-10 11:35   ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-10 11:35 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:32, Hugh Dickins wrote:
> If counting page mlocks, we must not double-count: follow_page_pte() can
> tell if a page has already been Mlocked or not, but cannot tell if a pte
> has already been counted or not: that will have to be done when the pte
> is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
> for new anon pages, but there's no such tracking yet for others).
> 
> Delete all the FOLL_MLOCK code - faulting in the missing pages will do
> all that is necessary, without special mlock_vma_page() calls from here.
> 
> But then FOLL_POPULATE turns out to serve no purpose - it was there so
> that its absence would tell faultin_page() not to faultin page when
> setting up VM_LOCKONFAULT areas; but if there's no special work needed
> here for mlock, then there's no work at all here for VM_LOCKONFAULT.
> 
> Have I got that right?  I've not looked into the history, but see that
> FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
> purpose before?  Ah, yes, it was used to skip the old stack guard page.
> 
> And is it intentional that COW is not broken on existing pages when
> setting up a VM_LOCKONFAULT area?  I can see that being argued either
> way, and have no reason to disagree with current behaviour.

Yeah I think it's consistent with the two usecases stated for VM_LOCKONFAULT:
https://lore.kernel.org/all/1441720742-7803-1-git-send-email-emunson@akamai.com/

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap
  2022-02-06 21:34 ` [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap Hugh Dickins
@ 2022-02-10 15:30   ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-10 15:30 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:34, Hugh Dickins wrote:
> munlock_vma_pages_range() will still be required, when munlocking but
> not munmapping a set of pages; but when unmapping a pte, the mlock count
> will be maintained in much the same way as it will be maintained when
> mapping in the pte.  Which removes the need for munlock_vma_pages_all()
> on mlocked vmas when munmapping or exiting: eliminating the catastrophic
> contention on i_mmap_rwsem, and the need for page lock on the pages.
> 
> There is still a need to update locked_vm accounting according to the
> munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
> exit_mmap() does not need locked_vm updates, so delete unlock_range().
> 
> And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
> because of the uncertainty in blocking on all those page locks?
> No fear of that now, so permit the OOM reaper on mlocked vmas.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
  2022-02-06 21:36 ` [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Hugh Dickins
@ 2022-02-11 10:29   ` Vlastimil Babka
  2022-02-14  7:05     ` [PATCH v2 " Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 10:29 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:36, Hugh Dickins wrote:
> Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
> inline functions which check (vma->vm_flags & VM_LOCKED) before calling
> mlock_page() and munlock_page() in mm/mlock.c.
> 
> Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
> because we have understandable difficulty in accounting pte maps of THPs,
> and if passed a PageHead page, mlock_page() and munlock_page() cannot
> tell whether it's a pmd map to be counted or a pte map to be ignored.
> 
> Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
> others, and use that to call mlock_vma_page() at the end of the page
> adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
> beginning? unimportant, but end was easier for assertions in testing).
> 
> No page lock is required (although almost all adds happen to hold it):
> delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
> Certainly page lock did serialize with page migration, but I'm having
> difficulty explaining why that was ever important.
> 
> Mlock accounting on THPs has been hard to define, differed between anon
> and file, involved PageDoubleMap in some places and not others, required
> clear_page_mlock() at some points.  Keep it simple now: just count the
> pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
> 
> page_add_new_anon_rmap() callers unchanged: they have long been calling
> lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
> handling (it also checks for not VM_SPECIAL: I think that's overcautious,
> and inconsistent with other checks, that mmap_region() already prevents
> VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Yeah, much better when most of the sites are encapsulated in rmap operations
like this.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance
  2022-02-06 21:38 ` [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance Hugh Dickins
@ 2022-02-11 11:42   ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 11:42 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:38, Hugh Dickins wrote:
> Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
> of the munlocking to clear_page_mlock(), since PageMlocked is typically
> still set when mapcount has fallen to 0.  That is not what we want: we
> want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
> on the integrity of of the mlock/munlock protocol - small numbers are
> not surprising, but big numbers mean the protocol is not working.
> 
> That could be easily fixed by placing munlock_vma_page() at the start of
> page_remove_rmap(); but later in the series we shall want to batch the
> munlocking, and that too would tend to leave PageMlocked still set at
> the point when it is checked.
> 
> So delete clear_page_mlock() now: leave it instead to release_pages()
> (and __page_cache_release()) to do this backstop clearing of Mlocked,
> when page refcount has fallen to 0.  If a pinned page occasionally gets
> counted as Mlocked and Unevictable until it is unpinned, that's okay.
> 
> A slightly regrettable side-effect of this change is that, since
> release_pages() and __page_cache_release() may be called at interrupt
> time, those places which update NR_MLOCK with interrupts enabled
> had better use mod_zone_page_state() than __mod_zone_page_state()
> (but holding the lruvec lock always has interrupts disabled).
> 
> This change, forcing Mlocked off when refcount 0 instead of earlier
> when mapcount 0, is not fundamental: it can be reversed if performance
> or something else is found to suffer; but this is the easiest way to
> separate the stats - let's not complicate that without good reason.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable
  2022-02-06 21:40 ` [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable Hugh Dickins
@ 2022-02-11 12:27   ` Vlastimil Babka
  2022-02-14  5:42     ` Hugh Dickins
  2022-02-11 18:07   ` Vlastimil Babka
  1 sibling, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 12:27 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:40, Hugh Dickins wrote:
> Previous patches have been preparatory: now implement page->mlock_count.
> The ordering of the "Unevictable LRU" is of no significance, and there is
> no point holding unevictable pages on a list: place page->mlock_count to
> overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
> which needs to be even so as not to satisfy PageTail - though 2 could be
> added instead of 1 for each mlock, if that's ever an improvement).
> 
> But it's only safe to rely on or modify page->mlock_count while lruvec
> lock is held and page is on unevictable "LRU" - we can save lots of edits
> by continuing to pretend that there's an imaginary LRU here (there is an
> unevictable count which still needs to be maintained, but not a list).
> 
> The mlock_count technique suffers from an unreliability much like with
> page_mlock(): while someone else has the page off LRU, not much can
> be done.  As before, err on the safe side (behave as if mlock_count 0),
> and let try_to_unlock_one() move the page to unevictable if reclaim finds
> out later on - a few misplaced pages don't matter, what we want to avoid
> is imbalancing reclaim by flooding evictable lists with unevictable pages.
> 
> I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
> if we have taken lruvec lock to get the page off its present list, then
> we save everyone trouble (and however many extra atomic ops) by putting
> it on its destination list immediately.

Good point.

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mm_inline.h | 11 +++++--
>  include/linux/mm_types.h  | 19 +++++++++--
>  mm/huge_memory.c          |  5 ++-
>  mm/memcontrol.c           |  3 +-
>  mm/mlock.c                | 68 +++++++++++++++++++++++++++++++--------
>  mm/mmzone.c               |  7 ++++
>  mm/swap.c                 |  1 +
>  7 files changed, 92 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index b725839dfe71..884d6f6af05b 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -99,7 +99,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
>  
>  	update_lru_size(lruvec, lru, folio_zonenum(folio),
>  			folio_nr_pages(folio));
> -	list_add(&folio->lru, &lruvec->lists[lru]);
> +	if (lru != LRU_UNEVICTABLE)
> +		list_add(&folio->lru, &lruvec->lists[lru]);
>  }
>  
>  static __always_inline void add_page_to_lru_list(struct page *page,
> @@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
>  
>  	update_lru_size(lruvec, lru, folio_zonenum(folio),
>  			folio_nr_pages(folio));
> +	/* This is not expected to be used on LRU_UNEVICTABLE */

Felt uneasy about this at first because it's just a _tail version of
lruvec_add_folio, and there's probably nothing fundamental about the users
of _tail to not encounter unevictable pages. But if the assumption is ever
violated, the poisoned list head should make it immediately clear, so I
guess that's fine.

>  	list_add_tail(&folio->lru, &lruvec->lists[lru]);
>  }
>  
> @@ -127,8 +129,11 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  static __always_inline
>  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
>  {
> -	list_del(&folio->lru);
> -	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
> +	enum lru_list lru = folio_lru_list(folio);
> +
> +	if (lru != LRU_UNEVICTABLE)
> +		list_del(&folio->lru);
> +	update_lru_size(lruvec, lru, folio_zonenum(folio),
>  			-folio_nr_pages(folio));
>  }
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking
  2022-02-06 21:42 ` [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking Hugh Dickins
@ 2022-02-11 16:45   ` Vlastimil Babka
  2022-02-14  6:32     ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 16:45 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:42, Hugh Dickins wrote:
> Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
> required to lower the mlock_counts when munlocking without munmapping;
> and its complement, implementation of mlock_vma_pages_range(), required
> to raise the mlock_counts on pages already there when a range is mlocked.
> 
> Combine them into just the one function mlock_vma_pages_range(), using
> walk_page_range() to run mlock_pte_range().  This approach fixes the
> "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
> https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/
> 
> Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
> a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
> pte first, it will not try to munlock the page: leaving release_pages()
> to correct it when the last reference to the page is gone - that's okay,
> a page is not evictable anyway while it is held by an extra reference.
> 
> Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
> a racing remove_migration_pte() or try_to_unmap_one() (depending on
> i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
> then mlock_pte_range() mlock it a second time.  This is harder to
> reproduce, but a more serious race because it could leave the page
> unevictable indefinitely though the area is munlocked afterwards.
> Guard against it by setting the (inappropriate) VM_IO flag,
> and modifying mlock_vma_page() to decline such vmas.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	pgoff_t pgoff;
>  	int nr_pages;
>  	int ret = 0;
> -	int lock = !!(newflags & VM_LOCKED);
> -	vm_flags_t old_flags = vma->vm_flags;
> +	vm_flags_t oldflags = vma->vm_flags;
>  
>  	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||

Nit: can use oldflags instead of vma->vm_flags above?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU
  2022-02-06 21:43 ` [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU Hugh Dickins
@ 2022-02-11 17:19   ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 17:19 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:43, Hugh Dickins wrote:
> Compaction, NUMA page movement, THP collapse/split, and memory failure
> do isolate unevictable pages from their "LRU", losing the record of
> mlock_count in doing so (isolators are likely to use page->lru for their
> own private lists, so mlock_count has to be presumed lost).
> 
> That's unfortunate, and we should put in some work to correct that: one
> can imagine a function to build up the mlock_count again - but it would
> require i_mmap_rwsem for read, so be careful where it's called.  Or
> page_referenced_one() and try_to_unmap_one() might do that extra work.
> 
> But one place that can very easily be improved is page migration's
> __unmap_and_move(): a small adjustment to where the successful new page
> is put back on LRU, and its mlock_count (if any) is built back up by
> remove_migration_ptes().
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/migrate.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7c4223ce2500..f4bcf1541b62 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1032,6 +1032,21 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page, mode);
>  
> +	/*
> +	 * When successful, push newpage to LRU immediately: so that if it
> +	 * turns out to be an mlocked page, remove_migration_ptes() will
> +	 * automatically build up the correct newpage->mlock_count for it.
> +	 *
> +	 * We would like to do something similar for the old page, when
> +	 * unsuccessful, and other cases when a page has been temporarily
> +	 * isolated from the unevictable LRU: but this case is the easiest.
> +	 */
> +	if (rc == MIGRATEPAGE_SUCCESS) {
> +		lru_cache_add(newpage);
> +		if (page_was_mapped)
> +			lru_add_drain();
> +	}
> +
>  	if (page_was_mapped)
>  		remove_migration_ptes(page,
>  			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
> @@ -1045,20 +1060,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	unlock_page(page);
>  out:
>  	/*
> -	 * If migration is successful, decrease refcount of the newpage
> +	 * If migration is successful, decrease refcount of the newpage,
>  	 * which will not free the page because new page owner increased
> -	 * refcounter. As well, if it is LRU page, add the page to LRU
> -	 * list in here. Use the old state of the isolated source page to
> -	 * determine if we migrated a LRU page. newpage was already unlocked
> -	 * and possibly modified by its owner - don't rely on the page
> -	 * state.
> +	 * refcounter.
>  	 */
> -	if (rc == MIGRATEPAGE_SUCCESS) {
> -		if (unlikely(!is_lru))
> -			put_page(newpage);
> -		else
> -			putback_lru_page(newpage);
> -	}
> +	if (rc == MIGRATEPAGE_SUCCESS)
> +		put_page(newpage);
>  
>  	return rc;
>  }


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
  2022-02-06 21:45 ` [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn() Hugh Dickins
@ 2022-02-11 17:43   ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 17:43 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:45, Hugh Dickins wrote:
> My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
> says that it can now be deleted; and that remains so when the next patch
> is added.

Agree with moderate certainty.

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/swap.c | 37 +++++++++----------------------------
>  1 file changed, 9 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 682a03301a2c..3f770b1ea2c1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
>  
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  
> +	folio_set_lru(folio);
>  	/*
> -	 * A folio becomes evictable in two ways:
> -	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
> -	 * 2) Before acquiring LRU lock to put the folio on the correct LRU
> -	 *    and then
> -	 *   a) do PageLRU check with lock [check_move_unevictable_pages]
> -	 *   b) do PageLRU check before lock [clear_page_mlock]
> -	 *
> -	 * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
> -	 * following strict ordering:
> -	 *
> -	 * #0: __pagevec_lru_add_fn		#1: clear_page_mlock
> -	 *
> -	 * folio_set_lru()			folio_test_clear_mlocked()
> -	 * smp_mb() // explicit ordering	// above provides strict
> -	 *					// ordering
> -	 * folio_test_mlocked()			folio_test_lru()
> +	 * Is an smp_mb__after_atomic() still required here, before
> +	 * folio_evictable() tests PageMlocked, to rule out the possibility
> +	 * of stranding an evictable folio on an unevictable LRU?  I think
> +	 * not, because munlock_page() only clears PageMlocked while the LRU
> +	 * lock is held.
>  	 *
> -	 *
> -	 * if '#1' does not observe setting of PG_lru by '#0' and
> -	 * fails isolation, the explicit barrier will make sure that
> -	 * folio_evictable check will put the folio on the correct
> -	 * LRU. Without smp_mb(), folio_set_lru() can be reordered
> -	 * after folio_test_mlocked() check and can make '#1' fail the
> -	 * isolation of the folio whose mlocked bit is cleared (#0 is
> -	 * also looking at the same folio) and the evictable folio will
> -	 * be stranded on an unevictable LRU.
> +	 * (That is not true of __page_cache_release(), and not necessarily
> +	 * true of release_pages(): but those only clear PageMlocked after
> +	 * put_page_testzero() has excluded any other users of the page.)
>  	 */
> -	folio_set_lru(folio);
> -	smp_mb__after_atomic();
> -
>  	if (folio_evictable(folio)) {
>  		if (was_unevictable)
>  			__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable
  2022-02-06 21:40 ` [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable Hugh Dickins
  2022-02-11 12:27   ` Vlastimil Babka
@ 2022-02-11 18:07   ` Vlastimil Babka
  2022-02-14  6:28     ` Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 18:07 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:40, Hugh Dickins wrote:
> @@ -72,19 +91,40 @@ void mlock_page(struct page *page)
>   */
>  void munlock_page(struct page *page)
>  {
> +	struct lruvec *lruvec;
> +	int nr_pages = thp_nr_pages(page);
> +
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> +	lock_page_memcg(page);

Hm this (and unlock_page_memcg() below) didn't catch my attention until I
see patch 10/13 removes it again. It also AFAICS wasn't present in the code
removed by patch 1. Am I missing something or it wasn't necessary to add it
in the first place?

> +	lruvec = folio_lruvec_lock_irq(page_folio(page));
> +	if (PageLRU(page) && PageUnevictable(page)) {
> +		/* Then mlock_count is maintained, but might undercount */
> +		if (page->mlock_count)
> +			page->mlock_count--;
> +		if (page->mlock_count)
> +			goto out;
> +	}
> +	/* else assume that was the last mlock: reclaim will fix it if not */
> +
>  	if (TestClearPageMlocked(page)) {
> -		int nr_pages = thp_nr_pages(page);
> -
> -		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> -		if (!isolate_lru_page(page)) {
> -			putback_lru_page(page);
> -			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> -		} else if (PageUnevictable(page)) {
> -			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
> -		}
> +		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> +		if (PageLRU(page) || !PageUnevictable(page))
> +			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> +		else
> +			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
> +	}
> +
> +	/* page_evictable() has to be checked *after* clearing Mlocked */
> +	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
> +		del_page_from_lru_list(page, lruvec);
> +		ClearPageUnevictable(page);
> +		add_page_to_lru_list(page, lruvec);
> +		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
>  	}
> +out:
> +	unlock_page_lruvec_irq(lruvec);
> +	unlock_page_memcg(page);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec
  2022-02-06 21:47 ` [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec Hugh Dickins
  2022-02-09  8:15   ` Geert Uytterhoeven
@ 2022-02-11 18:26   ` Vlastimil Babka
  2022-02-14  7:15     ` [PATCH v2 " Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 18:26 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:47, Hugh Dickins wrote:
> A weakness of the page->mlock_count approach is the need for lruvec lock
> while holding page table lock.  That is not an overhead we would allow on
> normal pages, but I think acceptable just for pages in an mlocked area.
> But let's try to amortize the extra cost by gathering on per-cpu pagevec
> before acquiring the lruvec lock.
> 
> I have an unverified conjecture that the mlock pagevec might work out
> well for delaying the mlock processing of new file pages until they have
> got off lru_cache_add()'s pagevec and on to LRU.
> 
> The initialization of page->mlock_count is subject to races and awkward:
> 0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
> this commit, which just widens the window?  I haven't gone back to think
> it through.  Maybe someone can point out a better way to initialize it.

Not me, it seems.

> Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
> into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
> rather than lru_cache_add()'s pagevec.
> 
> Experimented with various orderings: the right thing seems to be for
> mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
> pagevec, but munlock_page() to leave TestClearPageMlocked to the later
> pagevec processing.
> 
> Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
> their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
> for that.
> 
> This still leaves acquiring lruvec locks under page table lock each time
> the pagevec fills (or a THP is added): which I suppose is rather silly,
> since they sit on pagevec waiting to be processed long after page table
> lock has been dropped; but I'm disinclined to uglify the calling sequence
> until some load shows an actual problem with it (nothing wrong with
> taking lruvec lock under page table lock, just "nicer" to do it less).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained
  2022-02-06 21:49 ` [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained Hugh Dickins
@ 2022-02-11 18:49   ` Vlastimil Babka
  2022-02-14  5:34     ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-11 18:49 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/6/22 22:49, Hugh Dickins wrote:
> Page migration of a VM_LOCKED page tends to fail, because when the old
> page is unmapped, it is put on the mlock pagevec with raised refcount,
> which then fails the freeze.
> 
> At first I thought this would be fixed by a local mlock_page_drain() at
> the upper rmap_walk() level - which would have nicely batched all the
> munlocks of that page; but tests show that the task can too easily move
> to another cpu, leaving pagevec residue behind which fails the migration.
> 
> So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
> from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
> TTU_IGNORE_MLOCK users would want the same treatment; and do the same
> in remove_migration_pte() - not important when successfully inserting
> a new page, but necessary when hoping to retry after failure.
> 
> Any new pagevec runs the risk of adding a new way of stranding, and we
> might discover other corners where mlock_page_drain() or lru_add_drain()
> would now help.  If the mlock pagevec raises doubts, we can easily add a
> sysctl to tune its length to 1, which reverts to synchronous operation.

Not a fan of adding new sysctls like those as that just pushes the failure
of kernel devs to poor admins :)
The old pagevec usage deleted by patch 1 was limited to the naturally larger
munlock_vma_pages_range() operation. The new per-cpu based one is more
general, which obviously has its advantages, but then it might bring new
corner cases.
So if this turns out to be an big problem, I would rather go back to the
limited scenario pagevec than a sysctl?

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/migrate.c | 2 ++
>  mm/rmap.c    | 4 ++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f4bcf1541b62..e7d0b68d5dcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  				page_add_file_rmap(new, vma, false);
>  			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
>  		}
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  
>  		/* No need to invalidate - it was non-present before */
>  		update_mmu_cache(vma, pvmw.address, pvmw.pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5442a5c97a85..714bfdc72c7b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		 * See Documentation/vm/mmu_notifier.rst
>  		 */
>  		page_remove_rmap(subpage, vma, PageHuge(page));
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  		put_page(page);
>  	}
>  
> @@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
>  		 * See Documentation/vm/mmu_notifier.rst
>  		 */
>  		page_remove_rmap(subpage, vma, PageHuge(page));
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  		put_page(page);
>  	}
>  


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained
  2022-02-11 18:49   ` Vlastimil Babka
@ 2022-02-14  5:34     ` Hugh Dickins
  2022-02-14  7:17       ` [PATCH v2 " Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  5:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:49, Hugh Dickins wrote:
> > 
> > Any new pagevec runs the risk of adding a new way of stranding, and we
> > might discover other corners where mlock_page_drain() or lru_add_drain()
> > would now help.  If the mlock pagevec raises doubts, we can easily add a
> > sysctl to tune its length to 1, which reverts to synchronous operation.
> 
> Not a fan of adding new sysctls like those as that just pushes the failure
> of kernel devs to poor admins :)
> The old pagevec usage deleted by patch 1 was limited to the naturally larger
> munlock_vma_pages_range() operation. The new per-cpu based one is more
> general, which obviously has its advantages, but then it might bring new
> corner cases.
> So if this turns out to be an big problem, I would rather go back to the
> limited scenario pagevec than a sysctl?

Okay, I'll delete that comment proposing a sysctl, which was more as
a possible safety measure for our internal experimentation than for
general use.  I just thought it was an easy way to force synchronous.

I don't expect a big problem.  The flush in this commit deals, I think,
with the only way it was treading on its own toes, using a pagevec at
a level which relied on the unlikelihood of a pagevec reference.

But I can imagine that such-and-such a test will expect Mlocked or
Unevictable to be exact, and this pagevec now delay it becoming exact.
I've not seen any example so far, but it does seem possible.  Maybe
we shall fix the test, maybe we shall add a drain.

I hadn't thought of limiting the use of pagevec to the mass operation
(if a significant problem emerges).  Yes, that might be a better idea,
thanks.  Anyway, we don't need to worry in advance.

A change I also had in this patch, orginally, was for
/proc/sys/vm/stat_refresh to lru_add_drain_all() first.  I'm surprised
that tests see good numbers without it doing so; but since I've not
actually seen the need for it yet, dropped that - we can always add
it later if need emerges.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable
  2022-02-11 12:27   ` Vlastimil Babka
@ 2022-02-14  5:42     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  5:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:40, Hugh Dickins wrote:
> > @@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> >  
> >  	update_lru_size(lruvec, lru, folio_zonenum(folio),
> >  			folio_nr_pages(folio));
> > +	/* This is not expected to be used on LRU_UNEVICTABLE */
> 
> Felt uneasy about this at first because it's just a _tail version of
> lruvec_add_folio, and there's probably nothing fundamental about the users
> of _tail to not encounter unevictable pages. But if the assumption is ever
> violated, the poisoned list head should make it immediately clear, so I
> guess that's fine.

Yes, I could have made that one check against LRU_UNEVICTABLE too, but
thought we would rather see the crash on the poisoned list head: since
specifically choosing the tail of an unordered (and imaginary) list
raises questions - it might turn out to be best permitted, it might turn
out to require a rethink; but until there is a case, let's crash on it.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable
  2022-02-11 18:07   ` Vlastimil Babka
@ 2022-02-14  6:28     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  6:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:40, Hugh Dickins wrote:
> > @@ -72,19 +91,40 @@ void mlock_page(struct page *page)
> >   */
> >  void munlock_page(struct page *page)
> >  {
> > +	struct lruvec *lruvec;
> > +	int nr_pages = thp_nr_pages(page);
> > +
> >  	VM_BUG_ON_PAGE(PageTail(page), page);
> >  
> > +	lock_page_memcg(page);
> 
> Hm this (and unlock_page_memcg() below) didn't catch my attention until I
> see patch 10/13 removes it again. It also AFAICS wasn't present in the code
> removed by patch 1. Am I missing something or it wasn't necessary to add it
> in the first place?

Something is needed to stabilize page->memcg, whoops I'm way out of date,
folio->memcg_data, before trying to get the lock on the relevant lruvec.

In commit_charge(), Johannes gives us a choice between four tools:
	 * - the page lock
	 * - LRU isolation
	 * - lock_page_memcg()
	 * - exclusive reference

The original code was using TestClearPageLRU inside isolate_lru_page()
to do it (also happened to have the page lock, but one tool is enough).

But I chose to use lock_page_memcg() at this stage, because we want to
do the TestClearPageMlocked part of the business even when !PageLRU;
and I don't entirely love the TestClearPageLRU method, since one will
fail if two try it concurrently.

Later, when doing the pagevec implementation, it seemed to become
more natural to use the TestClearPageLRU method - because that's how
pagevec_lru_move_fn() does it, or did I have a stronger reason for
making a different choice at that stage?  Maybe: I might have been
trying to keep the different functions as similar as possible.

But really we have too many ways to do that same thing, and my
choices may have been arbitrary, according to mood.  (When Alex Shi
popularized the TestClearPageLRU method, I did devise a patch which
used the lock_page_memcg() method throughout instead; but it was not
obviously better, so I didn't waste anyone else's time with it.)

I'm afraid that looking here again has led me to wonder whether
__munlock_page() in the final (10/13 pagevec) implementaton is correct
to be using __operations in its !isolated case.  But I'll have to come
back and think about that another time, must push forward tonight.

Hugh

> 
> > +	lruvec = folio_lruvec_lock_irq(page_folio(page));
> > +	if (PageLRU(page) && PageUnevictable(page)) {
> > +		/* Then mlock_count is maintained, but might undercount */
> > +		if (page->mlock_count)
> > +			page->mlock_count--;
> > +		if (page->mlock_count)
> > +			goto out;
> > +	}
> > +	/* else assume that was the last mlock: reclaim will fix it if not */
> > +
> >  	if (TestClearPageMlocked(page)) {
> > -		int nr_pages = thp_nr_pages(page);
> > -
> > -		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> > -		if (!isolate_lru_page(page)) {
> > -			putback_lru_page(page);
> > -			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> > -		} else if (PageUnevictable(page)) {
> > -			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
> > -		}
> > +		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> > +		if (PageLRU(page) || !PageUnevictable(page))
> > +			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> > +		else
> > +			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
> > +	}
> > +
> > +	/* page_evictable() has to be checked *after* clearing Mlocked */
> > +	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
> > +		del_page_from_lru_list(page, lruvec);
> > +		ClearPageUnevictable(page);
> > +		add_page_to_lru_list(page, lruvec);
> > +		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
> >  	}
> > +out:
> > +	unlock_page_lruvec_irq(lruvec);
> > +	unlock_page_memcg(page);
> >  }
> >  
> >  /*

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking
  2022-02-11 16:45   ` Vlastimil Babka
@ 2022-02-14  6:32     ` Hugh Dickins
  2022-02-14  7:09       ` [PATCH v2 " Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  6:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:42, Hugh Dickins wrote:
> > @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  	pgoff_t pgoff;
> >  	int nr_pages;
> >  	int ret = 0;
> > -	int lock = !!(newflags & VM_LOCKED);
> > -	vm_flags_t old_flags = vma->vm_flags;
> > +	vm_flags_t oldflags = vma->vm_flags;
> >  
> >  	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
> 
> Nit: can use oldflags instead of vma->vm_flags above?

Yes thanks, that is nicer, I'm making that change now.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-10  9:52       ` Vlastimil Babka
@ 2022-02-14  6:52         ` Hugh Dickins
  2022-02-14  6:59           ` [PATCH v2 " Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  6:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

On Thu, 10 Feb 2022, Vlastimil Babka wrote:
> On 2/9/22 23:28, Hugh Dickins wrote:
> > 
> > Though taking this apart and putting it back together brings its own
> > dangers.  That second suggestion probably won't fly very well, with
> > 06/13 using mlock_count only while on the unevictable.  I'm not sure
> > how much rethinking the bisection possibility deserves.
> 
> Right, if it's impractical to change for a potential and hopefully unlikely
> bad bisection luck, just a note at the end of each patch's changelog
> mentioning what temporarily doesn't work, should be enough.

I'm adding a paragraph on that to the end of this 01/13's changelog.
If you or akpm think it's better  duplicated in 02, 03, 04, 05, 06
then I think it will be easiest if Andrew edits it into them, rather
than me updating them one by one.

> > As to VM_LOCKED: yes, you're right, at this stage of the series the
> > munlock really ought to be clearing VM_LOCKED (even while it doesn't
> > go on to do anything about the pages), as it claims in the comment above.
> > I removed this line at a later stage (07/13), when changing it to
> > mlock_vma_pages_range() serving both mlock and munlock according to
> > whether VM_LOCKED is provided - and mistakenly folded back that deletion
> > to this patch.  End result the same, but better to restore that maskout
> > in this patch, as you suggest.
> 
> Great, thanks. That restores any effect on VM_LOCKONFAULT in any case as well.

Yes, it turned out to be a mistake in my rebasing the series, the original
had the VM_LOCKED_CLEAR_MASK masking still there (up until the interface
changes in 07/13).  Thanks for spotting that.

Vlastimil, many thanks for your valiant effort reviewing this series:
it's not at all easy, and I much appreciate the time you've put into it.

I'm now going to send out v2 updates to 01, 04, 07, 10, 11 (only),
still based on 5.17-rc2, but incorporating your and others' suggestions.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-14  6:52         ` Hugh Dickins
@ 2022-02-14  6:59           ` Hugh Dickins
  2022-02-14 10:07             ` Vlastimil Babka
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  6:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant.  Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder.  Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

Signed-off-by: Hugh Dickins <hughd@google.com>
---
v2: Restore munlock_vma_pages_range() VM_LOCKED_CLEAR_MASK, per Vlastimil.
    Add paragraph to changelog on what does not work yet, per Vlastimil.
    These v2 replacement patches based on 5.17-rc2 like v1.

 include/linux/rmap.h |   6 -
 mm/internal.h        |   2 +-
 mm/mlock.c           | 375 +++----------------------------------------
 mm/rmap.c            |  80 ---------
 4 files changed, 25 insertions(+), 438 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e704b1a4c06c..dc48aa8c2c94 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
  */
 int folio_mkclean(struct folio *);
 
-/*
- * called in munlock()/munmap() path to check for other vmas holding
- * the page mlocked.
- */
-void page_mlock(struct page *page);
-
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..e48c486d5ddf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -409,7 +409,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
  * must be called with vma's mmap_lock held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);
 
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..aec4ce7919da 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -46,12 +46,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * be placed on the LRU "unevictable" list, rather than the [in]active lists.
  * The unevictable list is an LRU sibling list to the [in]active lists.
  * PageUnevictable is set to indicate the unevictable state.
- *
- * When lazy mlocking via vmscan, it is important to ensure that the
- * vma's VM_LOCKED status is not concurrently being modified, otherwise we
- * may have mlocked a page that is being munlocked. So lazy mlock must take
- * the mmap_lock for read, and verify that the vma really is locked
- * (see mm/rmap.c).
  */
 
 /*
@@ -106,299 +100,28 @@ void mlock_vma_page(struct page *page)
 	}
 }
 
-/*
- * Finish munlock after successful page isolation
- *
- * Page must be locked. This is a wrapper for page_mlock()
- * and putback_lru_page() with munlock accounting.
- */
-static void __munlock_isolated_page(struct page *page)
-{
-	/*
-	 * Optimization: if the page was mapped just once, that's our mapping
-	 * and we don't need to check all the other vmas.
-	 */
-	if (page_mapcount(page) > 1)
-		page_mlock(page);
-
-	/* Did try_to_unlock() succeed or punt? */
-	if (!PageMlocked(page))
-		count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
-
-	putback_lru_page(page);
-}
-
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	int nr_pages = thp_nr_pages(page);
-
-	if (PageUnevictable(page))
-		__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	else
-		__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-}
-
 /**
  * munlock_vma_page - munlock a vma page
  * @page: page to be unlocked, either a normal page or THP page head
- *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
- *
- * called from munlock()/munmap() path with page supposedly on the LRU.
- * When we munlock a page, because the vma where we found the page is being
- * munlock()ed or munmap()ed, we want to check whether other vmas hold the
- * page locked so that we can leave it on the unevictable lru list and not
- * bother vmscan with it.  However, to walk the page's rmap list in
- * page_mlock() we must isolate the page from the LRU.  If some other
- * task has removed the page from the LRU, we won't be able to do that.
- * So we clear the PageMlocked as we might not get another chance.  If we
- * can't isolate the page, we leave it for putback_lru_page() and vmscan
- * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
 {
-	int nr_pages;
-
-	/* For page_mlock() and to serialize with page migration */
+	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (!TestClearPageMlocked(page)) {
-		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		return 0;
-	}
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-
-	if (!isolate_lru_page(page))
-		__munlock_isolated_page(page);
-	else
-		__munlock_isolation_failed(page);
-
-	return nr_pages - 1;
-}
-
-/*
- * convert get_user_pages() return value to posix mlock() error
- */
-static int __mlock_posix_error_return(long retval)
-{
-	if (retval == -EFAULT)
-		retval = -ENOMEM;
-	else if (retval == -ENOMEM)
-		retval = -EAGAIN;
-	return retval;
-}
-
-/*
- * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
- *
- * The fast path is available only for evictable pages with single mapping.
- * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need page_mlock() which can fail.
- * when !page_evictable(), we need the full redo logic of putback_lru_page to
- * avoid leaving evictable page in unevictable list.
- *
- * In case of success, @page is added to @pvec and @pgrescued is incremented
- * in case that the page was previously unevictable. @page is also unlocked.
- */
-static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
-		int *pgrescued)
-{
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	if (page_mapcount(page) <= 1 && page_evictable(page)) {
-		pagevec_add(pvec, page);
-		if (TestClearPageUnevictable(page))
-			(*pgrescued)++;
-		unlock_page(page);
-		return true;
-	}
-
-	return false;
-}
 
-/*
- * Putback multiple evictable pages to the LRU
- *
- * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
- * the pages might have meanwhile become unevictable but that is OK.
- */
-static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
-{
-	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
-	/*
-	 *__pagevec_lru_add() calls release_pages() so we don't call
-	 * put_page() explicitly
-	 */
-	__pagevec_lru_add(pvec);
-	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
-}
-
-/*
- * Munlock a batch of pages from the same zone
- *
- * The work is split to two main phases. First phase clears the Mlocked flag
- * and attempts to isolate the pages, all under a single zone lru lock.
- * The second phase finishes the munlock only for pages where isolation
- * succeeded.
- *
- * Note that the pagevec may be modified during the process.
- */
-static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
-{
-	int i;
-	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
-	struct pagevec pvec_putback;
-	struct lruvec *lruvec = NULL;
-	int pgrescued = 0;
-
-	pagevec_init(&pvec_putback);
-
-	/* Phase 1: page isolation */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-		struct folio *folio = page_folio(page);
-
-		if (TestClearPageMlocked(page)) {
-			/*
-			 * We already have pin from follow_page_mask()
-			 * so we can spare the get_page() here.
-			 */
-			if (TestClearPageLRU(page)) {
-				lruvec = folio_lruvec_relock_irq(folio, lruvec);
-				del_page_from_lru_list(page, lruvec);
-				continue;
-			} else
-				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
-		}
+	VM_BUG_ON_PAGE(PageTail(page), page);
 
-		/*
-		 * We won't be munlocking this page in the next phase
-		 * but we still need to release the follow_page_mask()
-		 * pin. We cannot do it under lru_lock however. If it's
-		 * the last pin, __page_cache_release() would deadlock.
-		 */
-		pagevec_add(&pvec_putback, pvec->pages[i]);
-		pvec->pages[i] = NULL;
-	}
-	if (lruvec) {
-		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-		unlock_page_lruvec_irq(lruvec);
-	} else if (delta_munlocked) {
-		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	}
+	if (TestClearPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
 
-	/* Now we can release pins of pages that we are not munlocking */
-	pagevec_release(&pvec_putback);
-
-	/* Phase 2: page munlock */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-
-		if (page) {
-			lock_page(page);
-			if (!__putback_lru_fast_prepare(page, &pvec_putback,
-					&pgrescued)) {
-				/*
-				 * Slow path. We don't want to lose the last
-				 * pin before unlock_page()
-				 */
-				get_page(page); /* for putback_lru_page() */
-				__munlock_isolated_page(page);
-				unlock_page(page);
-				put_page(page); /* from follow_page_mask() */
-			}
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (!isolate_lru_page(page)) {
+			putback_lru_page(page);
+			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		} else if (PageUnevictable(page)) {
+			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 		}
 	}
-
-	/*
-	 * Phase 3: page putback for pages that qualified for the fast path
-	 * This will also call put_page() to return pin from follow_page_mask()
-	 */
-	if (pagevec_count(&pvec_putback))
-		__putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
-			struct vm_area_struct *vma, struct zone *zone,
-			unsigned long start, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	/*
-	 * Initialize pte walk starting at the already pinned page where we
-	 * are sure that there is a pte, as it was pinned under the same
-	 * mmap_lock write op.
-	 */
-	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
-	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
-
-	/* The page next to the pinned page is the first we will try to get */
-	start += PAGE_SIZE;
-	while (start < end) {
-		struct page *page = NULL;
-		pte++;
-		if (pte_present(*pte))
-			page = vm_normal_page(vma, start, *pte);
-		/*
-		 * Break if page could not be obtained or the page's node+zone does not
-		 * match
-		 */
-		if (!page || page_zone(page) != zone)
-			break;
-
-		/*
-		 * Do not use pagevec for PTE-mapped THP,
-		 * munlock_vma_pages_range() will handle them.
-		 */
-		if (PageTransCompound(page))
-			break;
-
-		get_page(page);
-		/*
-		 * Increase the address that will be returned *before* the
-		 * eventual break due to pvec becoming full by adding the page
-		 */
-		start += PAGE_SIZE;
-		if (pagevec_add(pvec, page) == 0)
-			break;
-	}
-	pte_unmap_unlock(pte, ptl);
-	return start;
 }
 
 /*
@@ -413,75 +136,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  *
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
- *
- * We don't save and restore VM_LOCKED here because pages are
- * still on lru.  In unmap path, pages might be scanned by reclaim
- * and re-mlocked by page_mlock/try_to_unmap before we unmap and
- * free them.  This will result in freeing mlocked pages.
  */
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
-	while (start < end) {
-		struct page *page;
-		unsigned int page_mask = 0;
-		unsigned long page_increm;
-		struct pagevec pvec;
-		struct zone *zone;
-
-		pagevec_init(&pvec);
-		/*
-		 * Although FOLL_DUMP is intended for get_dump_page(),
-		 * it just so happens that its special treatment of the
-		 * ZERO_PAGE (returning an error instead of doing get_page)
-		 * suits munlock very well (and if somehow an abnormal page
-		 * has sneaked into the range, we won't oops here: great).
-		 */
-		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
-
-		if (page && !IS_ERR(page)) {
-			if (PageTransTail(page)) {
-				VM_BUG_ON_PAGE(PageMlocked(page), page);
-				put_page(page); /* follow_page_mask() */
-			} else if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to compute
-				 * the page_mask here instead.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zone, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
-		}
-		page_increm = 1 + page_mask;
-		start += page_increm * PAGE_SIZE;
-next:
-		cond_resched();
-	}
+	/* Reimplementation to follow in later commit */
 }
 
 /*
@@ -645,6 +306,18 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
 	return count >> PAGE_SHIFT;
 }
 
+/*
+ * convert get_user_pages() return value to posix mlock() error
+ */
+static int __mlock_posix_error_return(long retval)
+{
+	if (retval == -EFAULT)
+		retval = -ENOMEM;
+	else if (retval == -ENOMEM)
+		retval = -EAGAIN;
+	return retval;
+}
+
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
 {
 	unsigned long locked;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..7ce7f1946cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1996,76 +1996,6 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
 		rmap_walk(page, &rwc);
 }
 
-/*
- * Walks the vma's mapping a page and mlocks the page if any locked vma's are
- * found. Once one is found the page is locked and the scan can be terminated.
- */
-static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address, void *unused)
-{
-	struct page_vma_mapped_walk pvmw = {
-		.page = page,
-		.vma = vma,
-		.address = address,
-	};
-
-	/* An un-locked vma doesn't have any pages to lock, continue the scan */
-	if (!(vma->vm_flags & VM_LOCKED))
-		return true;
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/*
-		 * Need to recheck under the ptl to serialise with
-		 * __munlock_pagevec_fill() after VM_LOCKED is cleared in
-		 * munlock_vma_pages_range().
-		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked; but
-			 * this function is never called on a DoubleMap THP,
-			 * nor on an Anon THP (which may still be PTE-mapped
-			 * after DoubleMap was cleared).
-			 */
-			mlock_vma_page(page);
-			/*
-			 * No need to scan further once the page is marked
-			 * as mlocked.
-			 */
-			page_vma_mapped_walk_done(&pvmw);
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/**
- * page_mlock - try to mlock a page
- * @page: the page to be mlocked
- *
- * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
- * the page if any are found. The page will be returned with PG_mlocked cleared
- * if it is not mapped by any locked vmas.
- */
-void page_mlock(struct page *page)
-{
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_mlock_one,
-		.done = page_not_mapped,
-		.anon_lock = page_lock_anon_vma_read,
-
-	};
-
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
-
-	/* Anon THP are only marked as mlocked when singly mapped */
-	if (PageTransCompound(page) && PageAnon(page))
-		return;
-
-	rmap_walk(page, &rwc);
-}
-
 #ifdef CONFIG_DEVICE_PRIVATE
 struct make_exclusive_args {
 	struct mm_struct *mm;
@@ -2291,11 +2221,6 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
@@ -2344,11 +2269,6 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
  2022-02-11 10:29   ` Vlastimil Babka
@ 2022-02-14  7:05     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  7:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, Yang Li, linux-kernel, linux-mm

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Fix kerneldoc on mlock_page() to use colon, per Yang Li.
    mmotm will require minor fixup in mm/huge_memory.c like v1.

 include/linux/rmap.h    | 17 +++++++------
 kernel/events/uprobes.c |  7 ++----
 mm/huge_memory.c        | 17 ++++++-------
 mm/hugetlb.c            |  4 +--
 mm/internal.h           | 36 ++++++++++++++++++++++----
 mm/khugepaged.c         |  4 +--
 mm/ksm.c                | 12 +--------
 mm/memory.c             | 45 +++++++++++----------------------
 mm/migrate.c            |  9 ++-----
 mm/mlock.c              | 21 ++++++----------
 mm/rmap.c               | 56 +++++++++++++++++++----------------------
 mm/userfaultfd.c        | 14 ++++++-----
 12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long address, bool compound);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+		unsigned long address, int flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+		unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long);
+		unsigned long address);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-				unsigned long);
+		unsigned long address);
 
 static inline void page_dup_rmap(struct page *page, bool compound)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 			return err;
 	}
 
-	/* For try_to_free_swap() and munlock_vma_page() below */
+	/* For try_to_free_swap() below */
 	lock_page(old_page);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, vma, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
-
-	if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
-		munlock_vma_page(old_page);
 	put_page(old_page);
 
 	err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				set_page_dirty(page);
 			if (!PageReferenced(page) && pmd_young(old_pmd))
 				SetPageReferenced(page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 		}
 		unlock_page_memcg(page);
+
+		/* Above is effectively page_remove_rmap(page, vma, true) */
+		munlock_vma_page(page, vma, true);
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, vma, false);
 			put_page(page + i);
 		}
 	}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				do_unlock_page = true;
 			}
 		}
-		if (PageMlocked(page))
-			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	put_page(page);
 }
 
@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (PageAnon(new))
 		page_add_anon_rmap(new, vma, mmun_start, true);
 	else
-		page_add_file_rmap(new, true);
+		page_add_file_rmap(new, vma, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
-	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
-
 /*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
  */
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		munlock_page(page);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 */
 			spin_lock(ptl);
 			ptep_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, vma, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, vma, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			err = replace_page(vma, page, kpage, orig_pte);
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
-		munlock_vma_page(page);
-		if (!PageMlocked(kpage)) {
-			unlock_page(page);
-			lock_page(kpage);
-			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
-		}
-	}
-
 out_unlock:
 	unlock_page(page);
 out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
-	if (vma->vm_flags & VM_LOCKED)
-		mlock_vma_page(page);
-
 	/*
 	 * No need to invalidate - it was non-present before. However
 	 * secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-
 			if (is_device_private_entry(entry))
-				page_remove_rmap(page, false);
-
+				page_remove_rmap(page, vma, false);
 			put_page(page);
 			continue;
 		}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
 	return 0;
 }
 
-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	if (!pte_none(*pte))
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	page_add_file_rmap(page, vma, false);
+	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
 }
 
@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 			struct page *page, pgprot_t prot)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int retval;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (retval)
 		goto out;
 	retval = -ENOMEM;
-	pte = get_locked_pte(mm, addr, &ptl);
+	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
 }
 
 #ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
 	err = validate_page_before_insert(page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 		start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
 		for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
-			int err = insert_page_in_batch_locked(mm, pte,
+			int err = insert_page_in_batch_locked(vma, pte,
 				addr, pages[curr_page_idx], prot);
 			if (unlikely(err)) {
 				pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, vma, false);
 		}
 
 		/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 	if (old_page) {
-		/*
-		 * Don't let another task, with possibly unlocked vma,
-		 * keep the mlocked page.
-		 */
-		if (page_copied && (vma->vm_flags & VM_LOCKED)) {
-			lock_page(old_page);	/* LRU manipulation */
-			if (PageMlocked(old_page))
-				munlock_vma_page(old_page);
-			unlock_page(old_page);
-		}
 		if (page_copied)
 			free_swap_cache(old_page);
 		put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
 	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
-	page_add_file_rmap(page, true);
+	page_add_file_rmap(page, vma, true);
+
 	/*
 	 * deposit and withdraw with pmd lock held
 	 */
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, false);
+		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			if (PageAnon(new))
 				page_add_anon_rmap(new, vma, pvmw.address, false);
 			else
-				page_add_file_rmap(new, false);
+				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
-		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(new);
-
-		if (PageTransHuge(page) && PageMlocked(page))
-			clear_page_mlock(page);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 
 			if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index 5d7ced8303be..92f28258b4ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page: page to be mlocked, either a normal page or a THP head.
  */
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
 
 	if (!TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
 }
 
 /**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
  */
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page))) {
+	if (unlikely(PageKsm(page)))
 		unlock_page_memcg(page);
-		return;
-	}
 
 	/* address might be in next vma when migration races vma_adjust */
-	if (first)
+	else if (first)
 		__page_set_anon_rmap(page, vma, address,
 				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 /**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,
 
 /**
  * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	int i, nr = 1;
 
@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
-			struct page *head = compound_head(page);
-
 			VM_WARN_ON_ONCE(!PageLocked(page));
-
-			SetPageDoubleMap(head);
-			if (PageMlocked(page))
-				clear_page_mlock(head);
+			SetPageDoubleMap(compound_head(page));
 		}
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 out:
 	unlock_page_memcg(page);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	lock_page_memcg(page);
 
@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
 	 */
 out:
 	unlock_page_memcg(page);
+
+	munlock_vma_page(page, vma, compound);
 }
 
 /*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		/*
-		 * If the page is mlock()d, we cannot swap it out.
+		 * If the page is in an mlock()d vma, we must not swap it out.
 		 */
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked: so do
-			 * not set it on a DoubleMap THP, nor on an Anon THP
-			 * (which may still be PTE-mapped after DoubleMap was
-			 * cleared).  But stop unmapping even in those cases.
-			 */
-			if (!PageTransCompound(page) || (PageHead(page) &&
-			     !PageDoubleMap(page) && !PageAnon(page)))
-				mlock_vma_page(page);
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, false);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
 		}
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
 		 * There is a reference on the page for the swap entry which has
 		 * been removed, so shouldn't take another.
 		 */
-		page_remove_rmap(subpage, false);
+		page_remove_rmap(subpage, vma, false);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 
-	if (page_in_cache)
-		page_add_file_rmap(page, false);
-	else
+	if (page_in_cache) {
+		/* Usually, cache pages are already added to LRU */
+		if (newly_allocated)
+			lru_cache_add(page);
+		page_add_file_rmap(page, dst_vma, false);
+	} else {
 		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	}
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 */
 	inc_mm_counter(dst_mm, mm_counter(page));
 
-	if (newly_allocated)
-		lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking
  2022-02-14  6:32     ` Hugh Dickins
@ 2022-02-14  7:09       ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, Hillf Danton, linux-kernel, linux-mm

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range().  This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time.  This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Delete reintroduced VM_LOCKED_CLEAR_MASKing here, per Vlastimil.
    Use oldflags instead of vma->vm_flags in mlock_fixup(), per Vlastimil.
    Make clearer comment on mmap_lock and WRITE_ONCE, per Hillf.

 mm/internal.h |   3 +-
 mm/mlock.c    | 111 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 91 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	/* VM_IO check prevents migration from double-counting during mlock */
+	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
 	    (compound || !PageTransCompound(page)))
 		mlock_page(page);
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index f8a3a54687dd..581ea8bf1b83 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
+#include <linux/pagewalk.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
@@ -127,25 +128,91 @@ void munlock_page(struct page *page)
 	unlock_page_memcg(page);
 }
 
+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+			   unsigned long end, struct mm_walk *walk)
+
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *start_pte, *pte;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (!pmd_present(*pmd))
+			goto out;
+		if (is_huge_zero_pmd(*pmd))
+			goto out;
+		page = pmd_page(*pmd);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+		goto out;
+	}
+
+	start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+		if (PageTransCompound(page))
+			continue;
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+	}
+	pte_unmap(start_pte);
+out:
+	spin_unlock(ptl);
+	cond_resched();
+	return 0;
+}
+
 /*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ *                           or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
  * @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- *  For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
  *
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared.  Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
  */
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+	unsigned long start, unsigned long end, vm_flags_t newflags)
 {
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	static const struct mm_walk_ops mlock_walk_ops = {
+		.pmd_entry = mlock_pte_range,
+	};
 
-	/* Reimplementation to follow in later commit */
+	/*
+	 * There is a slight chance that concurrent page migration,
+	 * or page reclaim finding a page of this now-VM_LOCKED vma,
+	 * will call mlock_vma_page() and raise page's mlock_count:
+	 * double counting, leaving the page unevictable indefinitely.
+	 * Communicate this danger to mlock_vma_page() with VM_IO,
+	 * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+	 * mmap_lock is held in write mode here, so this weird
+	 * combination should not be visible to other mmap_lock users;
+	 * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED.
+	 */
+	if (newflags & VM_LOCKED)
+		newflags |= VM_IO;
+	WRITE_ONCE(vma->vm_flags, newflags);
+
+	lru_add_drain();
+	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+	lru_add_drain();
+
+	if (newflags & VM_IO) {
+		newflags &= ~VM_IO;
+		WRITE_ONCE(vma->vm_flags, newflags);
+	}
 }
 
 /*
@@ -164,10 +231,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff_t pgoff;
 	int nr_pages;
 	int ret = 0;
-	int lock = !!(newflags & VM_LOCKED);
-	vm_flags_t old_flags = vma->vm_flags;
+	vm_flags_t oldflags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
 	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
@@ -199,9 +265,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * Keep track of amount of locked VM.
 	 */
 	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
+	if (!(newflags & VM_LOCKED))
 		nr_pages = -nr_pages;
-	else if (old_flags & VM_LOCKED)
+	else if (oldflags & VM_LOCKED)
 		nr_pages = 0;
 	mm->locked_vm += nr_pages;
 
@@ -211,11 +277,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
 
-	if (lock)
+	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+		/* No work to do, and mlocking twice would be wrong */
 		vma->vm_flags = newflags;
-	else
-		munlock_vma_pages_range(vma, start, end);
-
+	} else {
+		mlock_vma_pages_range(vma, start, end, newflags);
+	}
 out:
 	*prev = vma;
 	return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec
  2022-02-11 18:26   ` Vlastimil Babka
@ 2022-02-14  7:15     ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, Yang Li, SeongJae Park,
	Geert Uytterhoeven, linux-kernel, linux-mm

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Fix kerneldoc on mlock_page() and two more functions, per Yang Li.
    Fix build without CONFIG_MMU, per SeongJae, Geert, Naresh, lkp.

 mm/internal.h |   9 ++-
 mm/mlock.c    | 212 ++++++++++++++++++++++++++++++++++++++++++--------
 mm/swap.c     |  27 ++++---
 3 files changed, 201 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3f0dd3ffba2..18af980bb1b8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
  *
  * mlock is usually called at the end of page_add_*_rmap(),
  * munlock at the end of page_remove_rmap(); but new anon
- * pages are managed in lru_cache_add_inactive_or_unevictable().
+ * pages are managed by lru_cache_add_inactive_or_unevictable()
+ * calling mlock_new_page().
  *
  * @compound is used to include pmd mappings of THPs, but filter out
  * pte mappings of THPs, which cannot be consistently counted: a pte
@@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
 	    (compound || !PageTransCompound(page)))
 		munlock_page(page);
 }
+void mlock_new_page(struct page *page);
+bool need_mlock_page_drain(int cpu);
+void mlock_page_drain(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
@@ -503,6 +507,9 @@ static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
+static inline void mlock_new_page(struct page *page) { }
+static inline bool need_mlock_page_drain(int cpu) { return false; }
+static inline void mlock_page_drain(int cpu) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index 581ea8bf1b83..93d616ba3e22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -28,6 +28,8 @@
 
 #include "internal.h"
 
+static DEFINE_PER_CPU(struct pagevec, mlock_pvec);
+
 bool can_do_mlock(void)
 {
 	if (rlimit(RLIMIT_MEMLOCK) != 0)
@@ -49,57 +51,79 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/**
- * mlock_page - mlock a page
- * @page: page to be mlocked, either a normal page or a THP head.
- */
-void mlock_page(struct page *page)
+static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
-	int nr_pages = thp_nr_pages(page);
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return lruvec;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	if (unlikely(page_evictable(page))) {
+		/*
+		 * This is a little surprising, but quite possible:
+		 * PageMlocked must have got cleared already by another CPU.
+		 * Could this page be on the Unevictable LRU?  I'm not sure,
+		 * but move it now if so.
+		 */
+		if (PageUnevictable(page)) {
+			del_page_from_lru_list(page, lruvec);
+			ClearPageUnevictable(page);
+			add_page_to_lru_list(page, lruvec);
+			__count_vm_events(UNEVICTABLE_PGRESCUED,
+					  thp_nr_pages(page));
+		}
+		goto out;
 	}
 
-	/* There is nothing more we can do while it's off LRU */
-	if (!TestClearPageLRU(page))
-		return;
-
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
 	if (PageUnevictable(page)) {
-		page->mlock_count++;
+		if (PageMlocked(page))
+			page->mlock_count++;
 		goto out;
 	}
 
 	del_page_from_lru_list(page, lruvec);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
-	page->mlock_count = 1;
+	page->mlock_count = !!PageMlocked(page);
 	add_page_to_lru_list(page, lruvec);
-	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
 out:
 	SetPageLRU(page);
-	unlock_page_lruvec_irq(lruvec);
+	return lruvec;
 }
 
-/**
- * munlock_page - munlock a page
- * @page: page to be munlocked, either a normal page or a THP head.
- */
-void munlock_page(struct page *page)
+static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	/* As above, this is a little surprising, but possible */
+	if (unlikely(page_evictable(page)))
+		goto out;
+
+	SetPageUnevictable(page);
+	page->mlock_count = !!PageMlocked(page);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
+out:
+	add_page_to_lru_list(page, lruvec);
+	SetPageLRU(page);
+	return lruvec;
+}
+
+static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
 	int nr_pages = thp_nr_pages(page);
+	bool isolated = false;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	if (!TestClearPageLRU(page))
+		goto munlock;
 
-	lock_page_memcg(page);
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
-	if (PageLRU(page) && PageUnevictable(page)) {
+	isolated = true;
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	if (PageUnevictable(page)) {
 		/* Then mlock_count is maintained, but might undercount */
 		if (page->mlock_count)
 			page->mlock_count--;
@@ -108,24 +132,144 @@ void munlock_page(struct page *page)
 	}
 	/* else assume that was the last mlock: reclaim will fix it if not */
 
+munlock:
 	if (TestClearPageMlocked(page)) {
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (PageLRU(page) || !PageUnevictable(page))
+		if (isolated || !PageUnevictable(page))
 			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
 		else
 			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	}
 
 	/* page_evictable() has to be checked *after* clearing Mlocked */
-	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+	if (isolated && PageUnevictable(page) && page_evictable(page)) {
 		del_page_from_lru_list(page, lruvec);
 		ClearPageUnevictable(page);
 		add_page_to_lru_list(page, lruvec);
 		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
 out:
-	unlock_page_lruvec_irq(lruvec);
-	unlock_page_memcg(page);
+	if (isolated)
+		SetPageLRU(page);
+	return lruvec;
+}
+
+/*
+ * Flags held in the low bits of a struct page pointer on the mlock_pvec.
+ */
+#define LRU_PAGE 0x1
+#define NEW_PAGE 0x2
+#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
+#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
+
+/*
+ * mlock_pagevec() is derived from pagevec_lru_move_fn():
+ * perhaps that can make use of such page pointer flags in future,
+ * but for now just keep it for mlock.  We could use three separate
+ * pagevecs instead, but one feels better (munlocking a full pagevec
+ * does not need to drain mlocking pagevecs first).
+ */
+static void mlock_pagevec(struct pagevec *pvec)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long mlock;
+	struct page *page;
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE);
+		page = (struct page *)((unsigned long)page - mlock);
+		pvec->pages[i] = page;
+
+		if (mlock & LRU_PAGE)
+			lruvec = __mlock_page(page, lruvec);
+		else if (mlock & NEW_PAGE)
+			lruvec = __mlock_new_page(page, lruvec);
+		else
+			lruvec = __munlock_page(page, lruvec);
+	}
+
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
+}
+
+void mlock_page_drain(int cpu)
+{
+	struct pagevec *pvec;
+
+	pvec = &per_cpu(mlock_pvec, cpu);
+	if (pagevec_count(pvec))
+		mlock_pagevec(pvec);
+}
+
+bool need_mlock_page_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(mlock_pvec, cpu));
+}
+
+/**
+ * mlock_page - mlock a page already on (or temporarily off) LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	if (!TestSetPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_lru(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * mlock_new_page - mlock a newly allocated page not yet on LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_new_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+	int nr_pages = thp_nr_pages(page);
+
+	SetPageMlocked(page);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_new(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
+ */
+void munlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	/*
+	 * TestClearPageMlocked(page) must be left to __munlock_page(),
+	 * which will check whether the page is multiply mlocked.
+	 */
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
 }
 
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
diff --git a/mm/swap.c b/mm/swap.c
index 3f770b1ea2c1..842d5cd92cf6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,18 +490,12 @@ EXPORT_SYMBOL(folio_add_lru);
 void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
-	bool unevictable;
-
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
-	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-	}
-	lru_cache_add(page);
+	if (unlikely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED))
+		mlock_new_page(page);
+	else
+		lru_cache_add(page);
 }
 
 /*
@@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
+	mlock_page_drain(cpu);
 }
 
 /**
@@ -842,6 +837,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
 		    need_activate_page_drain(cpu) ||
+		    need_mlock_page_drain(cpu) ||
 		    has_bh_in_lru(cpu, NULL)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
@@ -1030,7 +1026,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	 * Is an smp_mb__after_atomic() still required here, before
 	 * folio_evictable() tests PageMlocked, to rule out the possibility
 	 * of stranding an evictable folio on an unevictable LRU?  I think
-	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * not, because __munlock_page() only clears PageMlocked while the LRU
 	 * lock is held.
 	 *
 	 * (That is not true of __page_cache_release(), and not necessarily
@@ -1043,7 +1039,14 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
-		folio->mlock_count = !!folio_test_mlocked(folio);
+		/*
+		 * folio->mlock_count = !!folio_test_mlocked(folio)?
+		 * But that leaves __mlock_page() in doubt whether another
+		 * actor has already counted the mlock or not.  Err on the
+		 * safe side, underestimate, let page reclaim fix it, rather
+		 * than leaving a page on the unevictable LRU indefinitely.
+		 */
+		folio->mlock_count = 0;
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/13] mm/munlock: page migration needs mlock pagevec drained
  2022-02-14  5:34     ` Hugh Dickins
@ 2022-02-14  7:17       ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2022-02-14  7:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Alistair Popple,
	Johannes Weiner, Rik van Riel, Suren Baghdasaryan, Yu Zhao,
	Greg Thelen, Shakeel Butt, linux-kernel, linux-mm

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Delete suggestion of sysctl from changelog, per Vlastimil.

 mm/migrate.c | 2 ++
 mm/rmap.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 01/13] mm/munlock: delete page_mlock() and all its works
  2022-02-14  6:59           ` [PATCH v2 " Hugh Dickins
@ 2022-02-14 10:07             ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2022-02-14 10:07 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Michal Hocko, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Alistair Popple, Johannes Weiner,
	Rik van Riel, Suren Baghdasaryan, Yu Zhao, Greg Thelen,
	Shakeel Butt, linux-kernel, linux-mm

On 2/14/22 07:59, Hugh Dickins wrote:
> We have recommended some applications to mlock their userspace, but that
> turns out to be counter-productive: when many processes mlock the same
> file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
> is needed for write, to remove any vma mapping that file from rmap's tree;
> but hogged for read by those with mlocks calling page_mlock() (formerly
> known as try_to_munlock()) on *each* page mapped from the file (the
> purpose being to find out whether another process has the page mlocked,
> so therefore it should not be unmlocked yet).
> 
> Several optimizations have been made in the past: one is to skip
> page_mlock() when mapcount tells that nothing else has this page
> mapped; but that doesn't help at all when others do have it mapped.
> This time around, I initially intended to add a preliminary search
> of the rmap tree for overlapping VM_LOCKED ranges; but that gets
> messy with locking order, when in doubt whether a page is actually
> present; and risks adding even more contention on the i_mmap_rwsem.
> 
> A solution would be much easier, if only there were space in struct page
> for an mlock_count... but actually, most of the time, there is space for
> it - an mlocked page spends most of its life on an unevictable LRU, but
> since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
> been redundant.  Let's try to reuse its page->lru.
> 
> But leave that until a later patch: in this patch, clear the ground by
> removing page_mlock(), and all the infrastructure that has gathered
> around it - which mostly hinders understanding, and will make reviewing
> new additions harder.  Don't mind those old comments about THPs, they
> date from before 4.5's refcounting rework: splitting is not a risk here.
> 
> Just keep a minimal version of munlock_vma_page(), as reminder of what it
> should attend to (in particular, the odd way PGSTRANDED is counted out of
> PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
> unchanged __mlock_posix_error_return() out of the way, down to above its
> caller: this series then makes no further change after mlock_fixup().
> 
> After this and each following commit, the kernel builds, boots and runs;
> but with deficiencies which may show up in testing of mlock and munlock.
> The system calls succeed or fail as before, and mlock remains effective
> in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
> may be shown too low after mlock, grow, then stay too high after munlock:
> with previously mlocked pages remaining unevictable for too long, until
> finally unmapped and freed and counts corrected. Normal service will be
> resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

Great!

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-02-14 10:45 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06 21:27 [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Hugh Dickins
2022-02-06 21:30 ` [PATCH 01/13] mm/munlock: delete page_mlock() and all its works Hugh Dickins
2022-02-09 18:31   ` Vlastimil Babka
2022-02-09 22:28     ` Hugh Dickins
2022-02-10  9:52       ` Vlastimil Babka
2022-02-14  6:52         ` Hugh Dickins
2022-02-14  6:59           ` [PATCH v2 " Hugh Dickins
2022-02-14 10:07             ` Vlastimil Babka
2022-02-06 21:32 ` [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE Hugh Dickins
2022-02-10 11:35   ` Vlastimil Babka
2022-02-06 21:34 ` [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap Hugh Dickins
2022-02-10 15:30   ` Vlastimil Babka
2022-02-06 21:36 ` [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Hugh Dickins
2022-02-11 10:29   ` Vlastimil Babka
2022-02-14  7:05     ` [PATCH v2 " Hugh Dickins
2022-02-06 21:38 ` [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final clearance Hugh Dickins
2022-02-11 11:42   ` Vlastimil Babka
2022-02-06 21:38 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Matthew Wilcox
2022-02-07 18:20   ` Hugh Dickins
2022-02-06 21:40 ` [PATCH 06/13] mm/munlock: maintain page->mlock_count while unevictable Hugh Dickins
2022-02-11 12:27   ` Vlastimil Babka
2022-02-14  5:42     ` Hugh Dickins
2022-02-11 18:07   ` Vlastimil Babka
2022-02-14  6:28     ` Hugh Dickins
2022-02-06 21:42 ` [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking Hugh Dickins
2022-02-11 16:45   ` Vlastimil Babka
2022-02-14  6:32     ` Hugh Dickins
2022-02-14  7:09       ` [PATCH v2 " Hugh Dickins
2022-02-06 21:43 ` [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU Hugh Dickins
2022-02-11 17:19   ` Vlastimil Babka
2022-02-06 21:45 ` [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn() Hugh Dickins
2022-02-11 17:43   ` Vlastimil Babka
2022-02-06 21:47 ` [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec Hugh Dickins
2022-02-09  8:15   ` Geert Uytterhoeven
2022-02-09 15:45     ` Hugh Dickins
2022-02-11 18:26   ` Vlastimil Babka
2022-02-14  7:15     ` [PATCH v2 " Hugh Dickins
2022-02-06 21:49 ` [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained Hugh Dickins
2022-02-11 18:49   ` Vlastimil Babka
2022-02-14  5:34     ` Hugh Dickins
2022-02-14  7:17       ` [PATCH v2 " Hugh Dickins
2022-02-06 21:51 ` [PATCH 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH) Hugh Dickins
2022-02-06 21:53 ` [PATCH 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP Hugh Dickins
     [not found] ` <20220207033518.2602-1-hdanton@sina.com>
2022-02-07 18:46   ` [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking Hugh Dickins
2022-02-09 15:35 ` [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling Michal Hocko
2022-02-09 16:21   ` Hugh Dickins
2022-02-09 21:01     ` Michal Hocko
2022-02-09 22:59       ` Hugh Dickins
2022-02-10  7:49         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).