All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/18] Rearrange batched folio freeing
@ 2024-02-27 17:42 Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
                   ` (17 more replies)
  0 siblings, 18 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing).  There's also
an associated belief that since we iterate the batch of folios three
times, we do better when the array is small (ie 15 entries) than we do
with a batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get
to the end.

It is possible we should increase the size of folio_batch.  Hopefully the
bots let us know if this introduces any performance regressions.

v3:
 - Rebased on next-20240227
 - Add folios_put_refs() to support unmapping large PTE-mapped folios
 - Used folio_batch_reinit() instead of assigning 0 to fbatch->nr.  This
   makes sure the iterator is correctly reset.
 
v2:
 - Redo the shrink_folio_list() patch to free the mapped folios at
   the end instead of calling try_to_unmap_flush() more often.
 - Improve a number of commit messages
 - Use pcp_allowed_order() instead of PAGE_ALLOC_COSTLY_ORDER (Ryan)
 - Fix move_folios_to_lru() comment (Ryan)
 - Add patches 15-18
 - Collect R-b tags from Ryan

Matthew Wilcox (Oracle) (18):
  mm: Make folios_put() the basis of release_pages()
  mm: Convert free_unref_page_list() to use folios
  mm: Add free_unref_folios()
  mm: Use folios_put() in __folio_batch_release()
  memcg: Add mem_cgroup_uncharge_folios()
  mm: Remove use of folio list from folios_put()
  mm: Use free_unref_folios() in put_pages_list()
  mm: use __page_cache_release() in folios_put()
  mm: Handle large folios in free_unref_folios()
  mm: Allow non-hugetlb large folios to be batch processed
  mm: Free folios in a batch in shrink_folio_list()
  mm: Free folios directly in move_folios_to_lru()
  memcg: Remove mem_cgroup_uncharge_list()
  mm: Remove free_unref_page_list()
  mm: Remove lru_to_page()
  mm: Convert free_pages_and_swap_cache() to use folios_put()
  mm: Use a folio in __collapse_huge_page_copy_succeeded()
  mm: Convert free_swap_cache() to take a folio

 include/linux/memcontrol.h |  26 +++--
 include/linux/mm.h         |  17 ++--
 include/linux/swap.h       |   8 +-
 mm/internal.h              |   4 +-
 mm/khugepaged.c            |  30 +++---
 mm/memcontrol.c            |  16 +--
 mm/memory.c                |   2 +-
 mm/mlock.c                 |   3 +-
 mm/page_alloc.c            |  76 +++++++-------
 mm/swap.c                  | 198 ++++++++++++++++++++-----------------
 mm/swap_state.c            |  33 ++++---
 mm/vmscan.c                |  52 ++++------
 12 files changed, 240 insertions(+), 225 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 02/18] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

By making release_pages() call folios_put(), we can get rid of the calls
to compound_head() for the callers that already know they have folios.
We can also get rid of the lock_batch tracking as we know the size
of the batch is limited by folio_batch.  This does reduce the maximum
number of pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX
(32) to PAGEVEC_SIZE (15).  I do not expect this to make a significant
difference, but if it does, we can increase PAGEVEC_SIZE to 31.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h |  16 +++++---
 mm/mlock.c         |   3 +-
 mm/swap.c          | 100 ++++++++++++++++++++++++++-------------------
 3 files changed, 70 insertions(+), 49 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c65171722b6..07d950e63c30 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,7 @@ struct anon_vma;
 struct anon_vma_chain;
 struct user_struct;
 struct pt_regs;
+struct folio_batch;
 
 extern int sysctl_page_lock_unfairness;
 
@@ -1533,6 +1534,8 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
+void folios_put_refs(struct folio_batch *folios, unsigned int *refs);
+
 /*
  * union release_pages_arg - an array of pages or folios
  *
@@ -1555,18 +1558,19 @@ void release_pages(release_pages_arg, int nr);
 /**
  * folios_put - Decrement the reference count on an array of folios.
  * @folios: The folios.
- * @nr: How many folios there are.
  *
- * Like folio_put(), but for an array of folios.  This is more efficient
- * than writing the loop yourself as it will optimise the locks which
- * need to be taken if the folios are freed.
+ * Like folio_put(), but for a batch of folios.  This is more efficient
+ * than writing the loop yourself as it will optimise the locks which need
+ * to be taken if the folios are freed.  The folios batch is returned
+ * empty and ready to be reused for another batch; there is no need to
+ * reinitialise it.
  *
  * Context: May be called in process or interrupt context, but not in NMI
  * context.  May be called while holding a spinlock.
  */
-static inline void folios_put(struct folio **folios, unsigned int nr)
+static inline void folios_put(struct folio_batch *folios)
 {
-	release_pages(folios, nr);
+	folios_put_refs(folios, NULL);
 }
 
 static inline void put_page(struct page *page)
diff --git a/mm/mlock.c b/mm/mlock.c
index 086546ac5766..1ed2f2ab37cd 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -206,8 +206,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 
 	if (lruvec)
 		unlock_page_lruvec_irq(lruvec);
-	folios_put(fbatch->folios, folio_batch_count(fbatch));
-	folio_batch_reinit(fbatch);
+	folios_put(fbatch);
 }
 
 void mlock_drain_local(void)
diff --git a/mm/swap.c b/mm/swap.c
index e5380d732c0d..3d51f8c72017 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -89,7 +89,7 @@ static void __page_cache_release(struct folio *folio)
 		__folio_clear_lru_flags(folio);
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
-	/* See comment on folio_test_mlocked in release_pages() */
+	/* See comment on folio_test_mlocked in folios_put() */
 	if (unlikely(folio_test_mlocked(folio))) {
 		long nr_pages = folio_nr_pages(folio);
 
@@ -175,7 +175,7 @@ static void lru_add_fn(struct lruvec *lruvec, struct folio *folio)
 	 * while the LRU lock is held.
 	 *
 	 * (That is not true of __page_cache_release(), and not necessarily
-	 * true of release_pages(): but those only clear the mlocked flag after
+	 * true of folios_put(): but those only clear the mlocked flag after
 	 * folio_put_testzero() has excluded any other users of the folio.)
 	 */
 	if (folio_evictable(folio)) {
@@ -221,8 +221,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
-	folios_put(fbatch->folios, folio_batch_count(fbatch));
-	folio_batch_reinit(fbatch);
+	folios_put(fbatch);
 }
 
 static void folio_batch_add_and_move(struct folio_batch *fbatch,
@@ -946,47 +945,30 @@ void lru_cache_disable(void)
 }
 
 /**
- * release_pages - batched put_page()
- * @arg: array of pages to release
- * @nr: number of pages
+ * folios_put_refs - Reduce the reference count on a batch of folios.
+ * @folios: The folios.
+ * @refs: The number of refs to subtract from each folio.
  *
- * Decrement the reference count on all the pages in @arg.  If it
- * fell to zero, remove the page from the LRU and free it.
+ * Like folio_put(), but for a batch of folios.  This is more efficient
+ * than writing the loop yourself as it will optimise the locks which need
+ * to be taken if the folios are freed.  The folios batch is returned
+ * empty and ready to be reused for another batch; there is no need
+ * to reinitialise it.  If @refs is NULL, we subtract one from each
+ * folio refcount.
  *
- * Note that the argument can be an array of pages, encoded pages,
- * or folio pointers. We ignore any encoded bits, and turn any of
- * them into just a folio that gets free'd.
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context.  May be called while holding a spinlock.
  */
-void release_pages(release_pages_arg arg, int nr)
+void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 {
 	int i;
-	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
-	unsigned int lock_batch;
 
-	for (i = 0; i < nr; i++) {
-		unsigned int nr_refs = 1;
-		struct folio *folio;
-
-		/* Turn any of the argument types into a folio */
-		folio = page_folio(encoded_page_ptr(encoded[i]));
-
-		/* Is our next entry actually "nr_pages" -> "nr_refs" ? */
-		if (unlikely(encoded_page_flags(encoded[i]) &
-			     ENCODED_PAGE_BIT_NR_PAGES_NEXT))
-			nr_refs = encoded_nr_pages(encoded[++i]);
-
-		/*
-		 * Make sure the IRQ-safe lock-holding time does not get
-		 * excessive with a continuous string of pages from the
-		 * same lruvec. The lock is held only if lruvec != NULL.
-		 */
-		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
-			unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = NULL;
-		}
+	for (i = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
+		unsigned int nr_refs = refs ? refs[i] : 1;
 
 		if (is_huge_zero_page(&folio->page))
 			continue;
@@ -1016,13 +998,8 @@ void release_pages(release_pages_arg arg, int nr)
 		}
 
 		if (folio_test_lru(folio)) {
-			struct lruvec *prev_lruvec = lruvec;
-
 			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
 									&flags);
-			if (prev_lruvec != lruvec)
-				lock_batch = 0;
-
 			lruvec_del_folio(lruvec, folio);
 			__folio_clear_lru_flags(folio);
 		}
@@ -1046,6 +1023,47 @@ void release_pages(release_pages_arg arg, int nr)
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
+	folio_batch_reinit(folios);
+}
+EXPORT_SYMBOL(folios_put_refs);
+
+/**
+ * release_pages - batched put_page()
+ * @arg: array of pages to release
+ * @nr: number of pages
+ *
+ * Decrement the reference count on all the pages in @arg.  If it
+ * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
+ */
+void release_pages(release_pages_arg arg, int nr)
+{
+	struct folio_batch fbatch;
+	int refs[PAGEVEC_SIZE];
+	struct encoded_page **encoded = arg.encoded_pages;
+	int i;
+
+	folio_batch_init(&fbatch);
+	for (i = 0; i < nr; i++) {
+		/* Turn any of the argument types into a folio */
+		struct folio *folio = page_folio(encoded_page_ptr(encoded[i]));
+
+		/* Is our next entry actually "nr_pages" -> "nr_refs" ? */
+		refs[fbatch.nr] = 1;
+		if (unlikely(encoded_page_flags(encoded[i]) &
+			     ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+			refs[fbatch.nr] = encoded_nr_pages(encoded[++i]);
+
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		folios_put_refs(&fbatch, refs);
+	}
+
+	if (fbatch.nr)
+		folios_put_refs(&fbatch, refs);
 }
 EXPORT_SYMBOL(release_pages);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 02/18] mm: Convert free_unref_page_list() to use folios
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 03/18] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

Most of its callees are not yet ready to accept a folio, but we know
all of the pages passed in are actually folios because they're linked
through ->lru.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/page_alloc.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96839b210abe..24798531fe98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2556,17 +2556,17 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	unsigned long __maybe_unused UP_flags;
-	struct page *page, *next;
+	struct folio *folio, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
 	int migratetype;
 
 	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
-		if (!free_unref_page_prepare(page, pfn, 0)) {
-			list_del(&page->lru);
+	list_for_each_entry_safe(folio, next, list, lru) {
+		unsigned long pfn = folio_pfn(folio);
+		if (!free_unref_page_prepare(&folio->page, pfn, 0)) {
+			list_del(&folio->lru);
 			continue;
 		}
 
@@ -2574,24 +2574,25 @@ void free_unref_page_list(struct list_head *list)
 		 * Free isolated pages directly to the allocator, see
 		 * comment in free_unref_page.
 		 */
-		migratetype = get_pcppage_migratetype(page);
+		migratetype = get_pcppage_migratetype(&folio->page);
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			list_del(&folio->lru);
+			free_one_page(folio_zone(folio), &folio->page, pfn,
+					0, migratetype, FPI_NONE);
 			continue;
 		}
 	}
 
-	list_for_each_entry_safe(page, next, list, lru) {
-		struct zone *zone = page_zone(page);
+	list_for_each_entry_safe(folio, next, list, lru) {
+		struct zone *zone = folio_zone(folio);
 
-		list_del(&page->lru);
-		migratetype = get_pcppage_migratetype(page);
+		list_del(&folio->lru);
+		migratetype = get_pcppage_migratetype(&folio->page);
 
 		/*
 		 * Either different zone requiring a different pcp lock or
 		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * folios.
 		 */
 		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
 			if (pcp) {
@@ -2602,15 +2603,16 @@ void free_unref_page_list(struct list_head *list)
 			batch_count = 0;
 
 			/*
-			 * trylock is necessary as pages may be getting freed
+			 * trylock is necessary as folios may be getting freed
 			 * from IRQ or SoftIRQ context after an IO completion.
 			 */
 			pcp_trylock_prepare(UP_flags);
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, page_to_pfn(page),
-					      0, migratetype, FPI_NONE);
+				free_one_page(zone, &folio->page,
+						folio_pfn(folio), 0,
+						migratetype, FPI_NONE);
 				locked_zone = NULL;
 				continue;
 			}
@@ -2624,8 +2626,8 @@ void free_unref_page_list(struct list_head *list)
 		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
 			migratetype = MIGRATE_MOVABLE;
 
-		trace_mm_page_free_batched(page);
-		free_unref_page_commit(zone, pcp, page, migratetype, 0);
+		trace_mm_page_free_batched(&folio->page);
+		free_unref_page_commit(zone, pcp, &folio->page, migratetype, 0);
 		batch_count++;
 	}
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 03/18] mm: Add free_unref_folios()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 02/18] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 04/18] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Iterate over a folio_batch rather than a linked list.  This is easier for
the CPU to prefetch and has a batch count naturally built in so we don't
need to track it.  Again, this lowers the maximum lock hold time from
32 folios to 15, but I do not expect this to have a significant effect.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h   |  5 +++--
 mm/page_alloc.c | 59 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b680a749cc37..3ca7e9d45b33 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -452,8 +452,9 @@ extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
 
-extern void free_unref_page(struct page *page, unsigned int order);
-extern void free_unref_page_list(struct list_head *list);
+void free_unref_page(struct page *page, unsigned int order);
+void free_unref_folios(struct folio_batch *fbatch);
+void free_unref_page_list(struct list_head *list);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 24798531fe98..ff8759a69221 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -32,6 +32,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/pagevec.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
 #include <linux/vmstat.h>
@@ -2551,57 +2552,51 @@ void free_unref_page(struct page *page, unsigned int order)
 }
 
 /*
- * Free a list of 0-order pages
+ * Free a batch of 0-order pages
  */
-void free_unref_page_list(struct list_head *list)
+void free_unref_folios(struct folio_batch *folios)
 {
 	unsigned long __maybe_unused UP_flags;
-	struct folio *folio, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	int batch_count = 0;
-	int migratetype;
+	int i, j, migratetype;
 
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(folio, next, list, lru) {
+	/* Prepare folios for freeing */
+	for (i = 0, j = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
 		unsigned long pfn = folio_pfn(folio);
-		if (!free_unref_page_prepare(&folio->page, pfn, 0)) {
-			list_del(&folio->lru);
+		if (!free_unref_page_prepare(&folio->page, pfn, 0))
 			continue;
-		}
 
 		/*
-		 * Free isolated pages directly to the allocator, see
+		 * Free isolated folios directly to the allocator, see
 		 * comment in free_unref_page.
 		 */
 		migratetype = get_pcppage_migratetype(&folio->page);
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&folio->lru);
 			free_one_page(folio_zone(folio), &folio->page, pfn,
 					0, migratetype, FPI_NONE);
 			continue;
 		}
+		if (j != i)
+			folios->folios[j] = folio;
+		j++;
 	}
+	folios->nr = j;
 
-	list_for_each_entry_safe(folio, next, list, lru) {
+	for (i = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
 		struct zone *zone = folio_zone(folio);
 
-		list_del(&folio->lru);
 		migratetype = get_pcppage_migratetype(&folio->page);
 
-		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * folios.
-		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		/* Different zone requires a different pcp lock */
+		if (zone != locked_zone) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
 			}
 
-			batch_count = 0;
-
 			/*
 			 * trylock is necessary as folios may be getting freed
 			 * from IRQ or SoftIRQ context after an IO completion.
@@ -2628,13 +2623,31 @@ void free_unref_page_list(struct list_head *list)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype, 0);
-		batch_count++;
 	}
 
 	if (pcp) {
 		pcp_spin_unlock(pcp);
 		pcp_trylock_finish(UP_flags);
 	}
+	folio_batch_reinit(folios);
+}
+
+void free_unref_page_list(struct list_head *list)
+{
+	struct folio_batch fbatch;
+
+	folio_batch_init(&fbatch);
+	while (!list_empty(list)) {
+		struct folio *folio = list_first_entry(list, struct folio, lru);
+
+		list_del(&folio->lru);
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		free_unref_folios(&fbatch);
+	}
+
+	if (fbatch.nr)
+		free_unref_folios(&fbatch);
 }
 
 /*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 04/18] mm: Use folios_put() in __folio_batch_release()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 03/18] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 05/18] memcg: Add mem_cgroup_uncharge_folios() Matthew Wilcox (Oracle)
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

There's no need to indirect through release_pages() and iterate
over this batch of folios an extra time; we can just use the batch
that we have.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3d51f8c72017..1cfb7b897ebd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1083,8 +1083,7 @@ void __folio_batch_release(struct folio_batch *fbatch)
 		lru_add_drain();
 		fbatch->percpu_pvec_drained = true;
 	}
-	release_pages(fbatch->folios, folio_batch_count(fbatch));
-	folio_batch_reinit(fbatch);
+	folios_put(fbatch);
 }
 EXPORT_SYMBOL(__folio_batch_release);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 05/18] memcg: Add mem_cgroup_uncharge_folios()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 04/18] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 06/18] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

Almost identical to mem_cgroup_uncharge_list(), except it takes a
folio_batch instead of a list_head.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/memcontrol.h | 14 ++++++++++++--
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4e4caeaea404..46d9abb20761 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -721,10 +721,16 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
 	__mem_cgroup_uncharge_list(page_list);
 }
 
-void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
+void __mem_cgroup_uncharge_folios(struct folio_batch *folios);
+static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
+{
+	if (mem_cgroup_disabled())
+		return;
+	__mem_cgroup_uncharge_folios(folios);
+}
 
+void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_replace_folio(struct folio *old, struct folio *new);
-
 void mem_cgroup_migrate(struct folio *old, struct folio *new);
 
 /**
@@ -1299,6 +1305,10 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
 {
 }
 
+static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
+{
+}
+
 static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
 		unsigned int nr_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95c3fccb321b..4be37c9a0759 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -33,6 +33,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/vm_event_item.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -7564,6 +7565,18 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
 		uncharge_batch(&ug);
 }
 
+void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
+{
+	struct uncharge_gather ug;
+	unsigned int i;
+
+	uncharge_gather_clear(&ug);
+	for (i = 0; i < folios->nr; i++)
+		uncharge_folio(folios->folios[i], &ug);
+	if (ug.memcg)
+		uncharge_batch(&ug);
+}
+
 /**
  * mem_cgroup_replace_folio - Charge a folio's replacement.
  * @old: Currently circulating folio.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 06/18] mm: Remove use of folio list from folios_put()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 05/18] memcg: Add mem_cgroup_uncharge_folios() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 07/18] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

Instead of putting the interesting folios on a list, delete the
uninteresting one from the folio_batch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swap.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 1cfb7b897ebd..ee8b131bf32c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -961,12 +961,11 @@ void lru_cache_disable(void)
  */
 void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 {
-	int i;
-	LIST_HEAD(pages_to_free);
+	int i, j;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
-	for (i = 0; i < folios->nr; i++) {
+	for (i = 0, j = 0; i < folios->nr; i++) {
 		struct folio *folio = folios->folios[i];
 		unsigned int nr_refs = refs ? refs[i] : 1;
 
@@ -1016,14 +1015,20 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			count_vm_event(UNEVICTABLE_PGCLEARED);
 		}
 
-		list_add(&folio->lru, &pages_to_free);
+		if (j != i)
+			folios->folios[j] = folio;
+		j++;
 	}
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
+	if (!j) {
+		folio_batch_reinit(folios);
+		return;
+	}
 
-	mem_cgroup_uncharge_list(&pages_to_free);
-	free_unref_page_list(&pages_to_free);
-	folio_batch_reinit(folios);
+	folios->nr = j;
+	mem_cgroup_uncharge_folios(folios);
+	free_unref_folios(folios);
 }
 EXPORT_SYMBOL(folios_put_refs);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 07/18] mm: Use free_unref_folios() in put_pages_list()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 06/18] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 08/18] mm: use __page_cache_release() in folios_put() Matthew Wilcox (Oracle)
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Break up the list of folios into batches here so that the folios are
more likely to be cache hot when doing the rest of the processing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ee8b131bf32c..ad3f2e9448a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -138,22 +138,25 @@ EXPORT_SYMBOL(__folio_put);
  */
 void put_pages_list(struct list_head *pages)
 {
-	struct folio *folio, *next;
+	struct folio_batch fbatch;
+	struct folio *folio;
 
-	list_for_each_entry_safe(folio, next, pages, lru) {
-		if (!folio_put_testzero(folio)) {
-			list_del(&folio->lru);
+	folio_batch_init(&fbatch);
+	list_for_each_entry(folio, pages, lru) {
+		if (!folio_put_testzero(folio))
 			continue;
-		}
 		if (folio_test_large(folio)) {
-			list_del(&folio->lru);
 			__folio_put_large(folio);
 			continue;
 		}
 		/* LRU flag must be clear because it's passed using the lru */
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		free_unref_folios(&fbatch);
 	}
 
-	free_unref_page_list(pages);
+	if (fbatch.nr)
+		free_unref_folios(&fbatch);
 	INIT_LIST_HEAD(pages);
 }
 EXPORT_SYMBOL(put_pages_list);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 08/18] mm: use __page_cache_release() in folios_put()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 07/18] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 09/18] mm: Handle large folios in free_unref_folios() Matthew Wilcox (Oracle)
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave().  Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/memcontrol.h | 16 +++++-----
 mm/swap.c                  | 62 ++++++++++++++++++--------------------
 2 files changed, 37 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 46d9abb20761..8a0e8972a3d3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1705,18 +1705,18 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
 	return folio_lruvec_lock_irq(folio);
 }
 
-/* Don't lock again iff page's lruvec locked */
-static inline struct lruvec *folio_lruvec_relock_irqsave(struct folio *folio,
-		struct lruvec *locked_lruvec, unsigned long *flags)
+/* Don't lock again iff folio's lruvec locked */
+static inline void folio_lruvec_relock_irqsave(struct folio *folio,
+		struct lruvec **lruvecp, unsigned long *flags)
 {
-	if (locked_lruvec) {
-		if (folio_matches_lruvec(folio, locked_lruvec))
-			return locked_lruvec;
+	if (*lruvecp) {
+		if (folio_matches_lruvec(folio, *lruvecp))
+			return;
 
-		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+		unlock_page_lruvec_irqrestore(*lruvecp, *flags);
 	}
 
-	return folio_lruvec_lock_irqsave(folio, flags);
+	*lruvecp = folio_lruvec_lock_irqsave(folio, flags);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/swap.c b/mm/swap.c
index ad3f2e9448a4..dce5ea67ae05 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,22 +74,21 @@ static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
-/*
- * This path almost never happens for VM activity - pages are normally freed
- * in batches.  But it gets used by networking - and for compound pages.
- */
-static void __page_cache_release(struct folio *folio)
+static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
+		unsigned long *flagsp)
 {
 	if (folio_test_lru(folio)) {
-		struct lruvec *lruvec;
-		unsigned long flags;
-
-		lruvec = folio_lruvec_lock_irqsave(folio, &flags);
-		lruvec_del_folio(lruvec, folio);
+		folio_lruvec_relock_irqsave(folio, lruvecp, flagsp);
+		lruvec_del_folio(*lruvecp, folio);
 		__folio_clear_lru_flags(folio);
-		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
-	/* See comment on folio_test_mlocked in folios_put() */
+
+	/*
+	 * In rare cases, when truncation or holepunching raced with
+	 * munlock after VM_LOCKED was cleared, Mlocked may still be
+	 * found set here.  This does not indicate a problem, unless
+	 * "unevictable_pgs_cleared" appears worryingly large.
+	 */
 	if (unlikely(folio_test_mlocked(folio))) {
 		long nr_pages = folio_nr_pages(folio);
 
@@ -99,9 +98,23 @@ static void __page_cache_release(struct folio *folio)
 	}
 }
 
+/*
+ * This path almost never happens for VM activity - pages are normally freed
+ * in batches.  But it gets used by networking - and for compound pages.
+ */
+static void page_cache_release(struct folio *folio)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long flags;
+
+	__page_cache_release(folio, &lruvec, &flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
+}
+
 static void __folio_put_small(struct folio *folio)
 {
-	__page_cache_release(folio);
+	page_cache_release(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, 0);
 }
@@ -115,7 +128,7 @@ static void __folio_put_large(struct folio *folio)
 	 * be called for hugetlb (it has a separate hugetlb_cgroup.)
 	 */
 	if (!folio_test_hugetlb(folio))
-		__page_cache_release(folio);
+		page_cache_release(folio);
 	destroy_large_folio(folio);
 }
 
@@ -216,7 +229,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 		if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
 			continue;
 
-		lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
+		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
 		move_fn(lruvec, folio);
 
 		folio_set_lru(folio);
@@ -999,24 +1012,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			continue;
 		}
 
-		if (folio_test_lru(folio)) {
-			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
-									&flags);
-			lruvec_del_folio(lruvec, folio);
-			__folio_clear_lru_flags(folio);
-		}
-
-		/*
-		 * In rare cases, when truncation or holepunching raced with
-		 * munlock after VM_LOCKED was cleared, Mlocked may still be
-		 * found set here.  This does not indicate a problem, unless
-		 * "unevictable_pgs_cleared" appears worryingly large.
-		 */
-		if (unlikely(folio_test_mlocked(folio))) {
-			__folio_clear_mlocked(folio);
-			zone_stat_sub_folio(folio, NR_MLOCK);
-			count_vm_event(UNEVICTABLE_PGCLEARED);
-		}
+		__page_cache_release(folio, &lruvec, &flags);
 
 		if (j != i)
 			folios->folios[j] = folio;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 09/18] mm: Handle large folios in free_unref_folios()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 08/18] mm: use __page_cache_release() in folios_put() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Call folio_undo_large_rmappable() if needed.  free_unref_page_prepare()
destroys the ability to call folio_order(), so stash the order in
folio->private for the benefit of the second loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page_alloc.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff8759a69221..aa7026d81d07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2552,7 +2552,7 @@ void free_unref_page(struct page *page, unsigned int order)
 }
 
 /*
- * Free a batch of 0-order pages
+ * Free a batch of folios
  */
 void free_unref_folios(struct folio_batch *folios)
 {
@@ -2565,19 +2565,25 @@ void free_unref_folios(struct folio_batch *folios)
 	for (i = 0, j = 0; i < folios->nr; i++) {
 		struct folio *folio = folios->folios[i];
 		unsigned long pfn = folio_pfn(folio);
-		if (!free_unref_page_prepare(&folio->page, pfn, 0))
+		unsigned int order = folio_order(folio);
+
+		if (order > 0 && folio_test_large_rmappable(folio))
+			folio_undo_large_rmappable(folio);
+		if (!free_unref_page_prepare(&folio->page, pfn, order))
 			continue;
 
 		/*
-		 * Free isolated folios directly to the allocator, see
-		 * comment in free_unref_page.
+		 * Free isolated folios and orders not handled on the PCP
+		 * directly to the allocator, see comment in free_unref_page.
 		 */
 		migratetype = get_pcppage_migratetype(&folio->page);
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (!pcp_allowed_order(order) ||
+		    is_migrate_isolate(migratetype)) {
 			free_one_page(folio_zone(folio), &folio->page, pfn,
-					0, migratetype, FPI_NONE);
+					order, migratetype, FPI_NONE);
 			continue;
 		}
+		folio->private = (void *)(unsigned long)order;
 		if (j != i)
 			folios->folios[j] = folio;
 		j++;
@@ -2587,7 +2593,9 @@ void free_unref_folios(struct folio_batch *folios)
 	for (i = 0; i < folios->nr; i++) {
 		struct folio *folio = folios->folios[i];
 		struct zone *zone = folio_zone(folio);
+		unsigned int order = (unsigned long)folio->private;
 
+		folio->private = NULL;
 		migratetype = get_pcppage_migratetype(&folio->page);
 
 		/* Different zone requires a different pcp lock */
@@ -2606,7 +2614,7 @@ void free_unref_folios(struct folio_batch *folios)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, &folio->page,
-						folio_pfn(folio), 0,
+						folio_pfn(folio), order,
 						migratetype, FPI_NONE);
 				locked_zone = NULL;
 				continue;
@@ -2622,7 +2630,8 @@ void free_unref_folios(struct folio_batch *folios)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(&folio->page);
-		free_unref_page_commit(zone, pcp, &folio->page, migratetype, 0);
+		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
+				order);
 	}
 
 	if (pcp) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 09/18] mm: Handle large folios in free_unref_folios() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-03-06 13:42   ` Ryan Roberts
  2024-02-27 17:42 ` [PATCH v3 11/18] mm: Free folios in a batch in shrink_folio_list() Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

Hugetlb folios still get special treatment, but normal large folios
can now be freed by free_unref_folios().  This should have a reasonable
performance impact, TBD.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index dce5ea67ae05..6b697d33fa5b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		if (folio_test_large(folio)) {
+		/* hugetlb has its own memcg */
+		if (folio_test_hugetlb(folio)) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			__folio_put_large(folio);
+			free_huge_folio(folio);
 			continue;
 		}
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 11/18] mm: Free folios in a batch in shrink_folio_list()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (9 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 12/18] mm: Free folios directly in move_folios_to_lru() Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Mel Gorman

Use free_unref_page_batch() to free the folios.  This may increase the
number of IPIs from calling try_to_unmap_flush() more often, but that's
going to be very workload-dependent.  It may even reduce the number of
IPIs as we now batch-free large folios instead of freeing them one at
a time.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d3c6e84475b9..0c88cb23cc40 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1026,14 +1026,15 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct pglist_data *pgdat, struct scan_control *sc,
 		struct reclaim_stat *stat, bool ignore_references)
 {
+	struct folio_batch free_folios;
 	LIST_HEAD(ret_folios);
-	LIST_HEAD(free_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
 
+	folio_batch_init(&free_folios);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
@@ -1432,14 +1433,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 */
 		nr_reclaimed += nr_pages;
 
-		/*
-		 * Is there need to periodically free_folio_list? It would
-		 * appear not as the counts should be low
-		 */
-		if (unlikely(folio_test_large(folio)))
-			destroy_large_folio(folio);
-		else
-			list_add(&folio->lru, &free_folios);
+		if (folio_batch_add(&free_folios, folio) == 0) {
+			mem_cgroup_uncharge_folios(&free_folios);
+			try_to_unmap_flush();
+			free_unref_folios(&free_folios);
+		}
 		continue;
 
 activate_locked_split:
@@ -1503,9 +1501,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
-	mem_cgroup_uncharge_list(&free_folios);
+	mem_cgroup_uncharge_folios(&free_folios);
 	try_to_unmap_flush();
-	free_unref_page_list(&free_folios);
+	free_unref_folios(&free_folios);
 
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 12/18] mm: Free folios directly in move_folios_to_lru()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (10 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 11/18] mm: Free folios in a batch in shrink_folio_list() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 13/18] memcg: Remove mem_cgroup_uncharge_list() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

The few folios which can't be moved to the LRU list (because their
refcount dropped to zero) used to be returned to the caller to dispose
of.  Make this simpler to call by freeing the folios directly through
free_unref_folios().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/vmscan.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0c88cb23cc40..c86c4694bcb1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1801,7 +1801,6 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * move_folios_to_lru() moves folios from private @list to appropriate LRU list.
- * On return, @list is reused as a list of folios to be freed by the caller.
  *
  * Returns the number of pages moved to the given lruvec.
  */
@@ -1809,8 +1808,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
-	LIST_HEAD(folios_to_free);
+	struct folio_batch free_folios;
 
+	folio_batch_init(&free_folios);
 	while (!list_empty(list)) {
 		struct folio *folio = lru_to_folio(list);
 
@@ -1839,12 +1839,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
-			if (unlikely(folio_test_large(folio))) {
+			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
-				destroy_large_folio(folio);
+				mem_cgroup_uncharge_folios(&free_folios);
+				free_unref_folios(&free_folios);
 				spin_lock_irq(&lruvec->lru_lock);
-			} else
-				list_add(&folio->lru, &folios_to_free);
+			}
 
 			continue;
 		}
@@ -1861,10 +1861,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
-	/*
-	 * To save our caller's stack, now use input list for pages to free.
-	 */
-	list_splice(&folios_to_free, list);
+	if (free_folios.nr) {
+		spin_unlock_irq(&lruvec->lru_lock);
+		mem_cgroup_uncharge_folios(&free_folios);
+		free_unref_folios(&free_folios);
+		spin_lock_irq(&lruvec->lru_lock);
+	}
 
 	return nr_moved;
 }
@@ -1943,8 +1945,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
-	mem_cgroup_uncharge_list(&folio_list);
-	free_unref_page_list(&folio_list);
 
 	/*
 	 * If dirty folios are scanned that are not queued for IO, it
@@ -2085,8 +2085,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	nr_activate = move_folios_to_lru(lruvec, &l_active);
 	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
-	/* Keep all free folios in l_active list */
-	list_splice(&l_inactive, &l_active);
 
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
@@ -2096,8 +2094,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	if (nr_rotated)
 		lru_note_cost(lruvec, file, 0, nr_rotated);
-	mem_cgroup_uncharge_list(&l_active);
-	free_unref_page_list(&l_active);
 	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
 			nr_deactivate, nr_rotated, sc->priority, file);
 }
@@ -4601,10 +4597,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
-	mem_cgroup_uncharge_list(&list);
-	free_unref_page_list(&list);
-
-	INIT_LIST_HEAD(&list);
 	list_splice_init(&clean, &list);
 
 	if (!list_empty(&list)) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 13/18] memcg: Remove mem_cgroup_uncharge_list()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (11 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 12/18] mm: Free folios directly in move_folios_to_lru() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 14/18] mm: Remove free_unref_page_list() Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

All users have been converted to mem_cgroup_uncharge_folios() so
we can remove this API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/memcontrol.h | 12 ------------
 mm/memcontrol.c            | 19 -------------------
 2 files changed, 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8a0e8972a3d3..6ed0c54a3773 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -713,14 +713,6 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 	__mem_cgroup_uncharge(folio);
 }
 
-void __mem_cgroup_uncharge_list(struct list_head *page_list);
-static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
-{
-	if (mem_cgroup_disabled())
-		return;
-	__mem_cgroup_uncharge_list(page_list);
-}
-
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios);
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
@@ -1301,10 +1293,6 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 {
 }
 
-static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
-{
-}
-
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4be37c9a0759..22db1760e9bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7546,25 +7546,6 @@ void __mem_cgroup_uncharge(struct folio *folio)
 	uncharge_batch(&ug);
 }
 
-/**
- * __mem_cgroup_uncharge_list - uncharge a list of page
- * @page_list: list of pages to uncharge
- *
- * Uncharge a list of pages previously charged with
- * __mem_cgroup_charge().
- */
-void __mem_cgroup_uncharge_list(struct list_head *page_list)
-{
-	struct uncharge_gather ug;
-	struct folio *folio;
-
-	uncharge_gather_clear(&ug);
-	list_for_each_entry(folio, page_list, lru)
-		uncharge_folio(folio, &ug);
-	if (ug.memcg)
-		uncharge_batch(&ug);
-}
-
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 	struct uncharge_gather ug;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 14/18] mm: Remove free_unref_page_list()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (12 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 13/18] memcg: Remove mem_cgroup_uncharge_list() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 15/18] mm: Remove lru_to_page() Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts

All callers now use free_unref_folios() so we can delete this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/internal.h   |  1 -
 mm/page_alloc.c | 18 ------------------
 2 files changed, 19 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3ca7e9d45b33..cc91830f6eae 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -454,7 +454,6 @@ extern int user_min_free_kbytes;
 
 void free_unref_page(struct page *page, unsigned int order);
 void free_unref_folios(struct folio_batch *fbatch);
-void free_unref_page_list(struct list_head *list);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa7026d81d07..01b60769726e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2641,24 +2641,6 @@ void free_unref_folios(struct folio_batch *folios)
 	folio_batch_reinit(folios);
 }
 
-void free_unref_page_list(struct list_head *list)
-{
-	struct folio_batch fbatch;
-
-	folio_batch_init(&fbatch);
-	while (!list_empty(list)) {
-		struct folio *folio = list_first_entry(list, struct folio, lru);
-
-		list_del(&folio->lru);
-		if (folio_batch_add(&fbatch, folio) > 0)
-			continue;
-		free_unref_folios(&fbatch);
-	}
-
-	if (fbatch.nr)
-		free_unref_folios(&fbatch);
-}
-
 /*
  * split_page takes a non-compound higher-order page, and splits it into
  * n (1<<order) sub-pages: page[0..n]
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 15/18] mm: Remove lru_to_page()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (13 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 14/18] mm: Remove free_unref_page_list() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 16/18] mm: Convert free_pages_and_swap_cache() to use folios_put() Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

The last user was removed over a year ago; remove the definition.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07d950e63c30..c4a76520c967 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,7 +227,6 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
-#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 static inline struct folio *lru_to_folio(struct list_head *head)
 {
 	return list_entry((head)->prev, struct folio, lru);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 16/18] mm: Convert free_pages_and_swap_cache() to use folios_put()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (14 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 15/18] mm: Remove lru_to_page() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 17/18] mm: Use a folio in __collapse_huge_page_copy_succeeded() Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 18/18] mm: Convert free_swap_cache() to take a folio Matthew Wilcox (Oracle)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

Process the pages in batch-sized quantities instead of all-at-once.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap_state.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2f540748f7c0..2a73d3bc5d48 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -15,6 +15,7 @@
 #include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/backing-dev.h>
 #include <linux/blkdev.h>
 #include <linux/migrate.h>
@@ -310,21 +311,25 @@ void free_page_and_swap_cache(struct page *page)
  */
 void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
+	struct folio_batch folios;
+	unsigned int refs[PAGEVEC_SIZE];
+
 	lru_add_drain();
+	folio_batch_init(&folios);
 	for (int i = 0; i < nr; i++) {
-		struct page *page = encoded_page_ptr(pages[i]);
+		struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
 
-		/*
-		 * Skip over the "nr_pages" entry. It's sufficient to call
-		 * free_swap_cache() only once per folio.
-		 */
+		free_swap_cache(&folio->page);
+		refs[folios.nr] = 1;
 		if (unlikely(encoded_page_flags(pages[i]) &
 			     ENCODED_PAGE_BIT_NR_PAGES_NEXT))
-			i++;
-
-		free_swap_cache(page);
+			refs[folios.nr] = encoded_nr_pages(pages[++i]);
+		
+		if (folio_batch_add(&folios, folio) == 0)
+			folios_put_refs(&folios, refs);
 	}
-	release_pages(pages, nr);
+	if (folios.nr)
+		folios_put_refs(&folios, refs);
 }
 
 static inline bool swap_use_vma_readahead(void)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 17/18] mm: Use a folio in __collapse_huge_page_copy_succeeded()
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (15 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 16/18] mm: Convert free_pages_and_swap_cache() to use folios_put() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  2024-02-27 17:42 ` [PATCH v3 18/18] mm: Convert free_swap_cache() to take a folio Matthew Wilcox (Oracle)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), linux-mm

These pages are all chained together through the lru list, so we know
they're folios.  Use the folio APIs to save three hidden calls to
compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/khugepaged.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2771fc043b3b..5cc39c3f3847 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -689,9 +689,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						spinlock_t *ptl,
 						struct list_head *compound_pagelist)
 {
-	struct folio *src_folio;
-	struct page *src_page;
-	struct page *tmp;
+	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
@@ -710,10 +708,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
 			}
 		} else {
-			src_page = pte_page(pteval);
-			src_folio = page_folio(src_page);
-			if (!folio_test_large(src_folio))
-				release_pte_folio(src_folio);
+			struct page *src_page = pte_page(pteval);
+
+			src = page_folio(src_page);
+			if (!folio_test_large(src))
+				release_pte_folio(src);
 			/*
 			 * ptl mostly unnecessary, but preempt has to
 			 * be disabled to update the per-cpu stats
@@ -721,20 +720,19 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 			 */
 			spin_lock(ptl);
 			ptep_clear(vma->vm_mm, address, _pte);
-			folio_remove_rmap_pte(src_folio, src_page, vma);
+			folio_remove_rmap_pte(src, src_page, vma);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
 	}
 
-	list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
-		list_del(&src_page->lru);
-		mod_node_page_state(page_pgdat(src_page),
-				    NR_ISOLATED_ANON + page_is_file_lru(src_page),
-				    -compound_nr(src_page));
-		unlock_page(src_page);
-		free_swap_cache(src_page);
-		putback_lru_page(src_page);
+	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
+		list_del(&src->lru);
+		node_stat_sub_folio(src, NR_ISOLATED_ANON +
+				folio_is_file_lru(src));
+		folio_unlock(src);
+		free_swap_cache(&src->page);
+		folio_putback_lru(src);
 	}
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v3 18/18] mm: Convert free_swap_cache() to take a folio
  2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
                   ` (16 preceding siblings ...)
  2024-02-27 17:42 ` [PATCH v3 17/18] mm: Use a folio in __collapse_huge_page_copy_succeeded() Matthew Wilcox (Oracle)
@ 2024-02-27 17:42 ` Matthew Wilcox (Oracle)
  17 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-27 17:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, Ryan Roberts, David Hildenbrand

All but one caller already has a folio, so convert
free_page_and_swap_cache() to have a folio and remove the call to
page_folio().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h |  8 ++++----
 mm/khugepaged.c      |  2 +-
 mm/memory.c          |  2 +-
 mm/swap_state.c      | 12 ++++++------
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3e2b038852bb..a211a0383425 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -440,9 +440,9 @@ static inline unsigned long total_swapcache_pages(void)
 	return global_node_page_state(NR_SWAPCACHE);
 }
 
-extern void free_swap_cache(struct page *page);
-extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct encoded_page **, int);
+void free_swap_cache(struct folio *folio);
+void free_page_and_swap_cache(struct page *);
+void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -524,7 +524,7 @@ static inline void put_swap_device(struct swap_info_struct *si)
 /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
 #define free_swap_and_cache(e) is_pfn_swap_entry(e)
 
-static inline void free_swap_cache(struct page *page)
+static inline void free_swap_cache(struct folio *folio)
 {
 }
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cc39c3f3847..d19fba3355a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -731,7 +731,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 		node_stat_sub_folio(src, NR_ISOLATED_ANON +
 				folio_is_file_lru(src));
 		folio_unlock(src);
-		free_swap_cache(&src->page);
+		free_swap_cache(src);
 		folio_putback_lru(src);
 	}
 }
diff --git a/mm/memory.c b/mm/memory.c
index a4b9460d6ca5..f2bddabae199 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3452,7 +3452,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		folio_put(new_folio);
 	if (old_folio) {
 		if (page_copied)
-			free_swap_cache(&old_folio->page);
+			free_swap_cache(old_folio);
 		folio_put(old_folio);
 	}
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2a73d3bc5d48..b194dcf49f01 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -283,10 +283,8 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
  * folio_free_swap() _with_ the lock.
  * 					- Marcelo
  */
-void free_swap_cache(struct page *page)
+void free_swap_cache(struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
-
 	if (folio_test_swapcache(folio) && !folio_mapped(folio) &&
 	    folio_trylock(folio)) {
 		folio_free_swap(folio);
@@ -300,9 +298,11 @@ void free_swap_cache(struct page *page)
  */
 void free_page_and_swap_cache(struct page *page)
 {
-	free_swap_cache(page);
+	struct folio *folio = page_folio(page);
+
+	free_swap_cache(folio);
 	if (!is_huge_zero_page(page))
-		put_page(page);
+		folio_put(folio);
 }
 
 /*
@@ -319,7 +319,7 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	for (int i = 0; i < nr; i++) {
 		struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
 
-		free_swap_cache(&folio->page);
+		free_swap_cache(folio);
 		refs[folios.nr] = 1;
 		if (unlikely(encoded_page_flags(pages[i]) &
 			     ENCODED_PAGE_BIT_NR_PAGES_NEXT))
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
@ 2024-03-06 13:42   ` Ryan Roberts
  2024-03-06 16:09     ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-06 13:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton; +Cc: linux-mm

Hi Matthew,

Afraid I have another bug for you...

On 27/02/2024 17:42, Matthew Wilcox (Oracle) wrote:
> Hugetlb folios still get special treatment, but normal large folios
> can now be freed by free_unref_folios().  This should have a reasonable
> performance impact, TBD.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

When running some swap tests with this change (which is in mm-stable) present, I see BadThings(TM). Usually I see a "bad page state" followed by a delay of a few seconds, followed by an oops or NULL pointer deref. Bisect points to this change, and if I revert it, the problem goes away.

Here is one example, running against mm-unstable (a7f399ae964e):

[   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
[   76.240196] kernel BUG at include/linux/mm.h:1120!
[   76.240198] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   76.240724]  dump_backtrace+0x98/0xf8
[   76.241523] Modules linked in:
[   76.241943]  show_stack+0x20/0x38
[   76.242282] 
[   76.242680]  dump_stack_lvl+0x48/0x60
[   76.242855] CPU: 2 PID: 62 Comm: kcompactd0 Not tainted 6.8.0-rc5-00456-ga7f399ae964e #16
[   76.243278]  dump_stack+0x18/0x28
[   76.244138] Hardware name: linux,dummy-virt (DT)
[   76.244510]  bad_page+0x88/0x128
[   76.244995] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.245370]  free_page_is_bad_report+0xa4/0xb8
[   76.246101] pc : migrate_folio_done+0x140/0x150
[   76.246572]  __free_pages_ok+0x370/0x4b0
[   76.247048] lr : migrate_folio_done+0x140/0x150
[   76.247489]  destroy_large_folio+0x94/0x108
[   76.247971] sp : ffff800083f5b8d0
[   76.248451]  __folio_put_large+0x70/0xc0
[   76.248807] x29: ffff800083f5b8d0
[   76.249256]  __folio_put+0xac/0xc0
[   76.249260]  deferred_split_scan+0x234/0x340
[   76.249607]  x28: 0000000000000000
[   76.249997]  do_shrink_slab+0x144/0x460
[   76.250444]  x27: ffff800083f5bb30
[   76.250829]  shrink_slab+0x2e0/0x4e0
[   76.251234] 
[   76.251604]  shrink_node+0x204/0x8a0
[   76.251979] x26: 0000000000000001
[   76.252147]  do_try_to_free_pages+0xd0/0x568
[   76.252527]  x25: 0000000000000010
[   76.252881]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.253337]  x24: fffffc0008552800
[   76.253687]  try_charge_memcg+0x12c/0x650
[   76.254219] 
[   76.254583]  __mem_cgroup_charge+0x6c/0xd0
[   76.255013] x23: ffff0000e6f353a8
[   76.255181]  __handle_mm_fault+0xe90/0x16a8
[   76.255624]  x22: ffff0013f5fa59c0
[   76.255977]  handle_mm_fault+0x70/0x2b0
[   76.256413]  x21: 0000000000000000
[   76.256756]  do_page_fault+0x100/0x4c0
[   76.257177] 
[   76.257540]  do_translation_fault+0xb4/0xd0
[   76.257932] x20: 0000000000000007
[   76.258095]  do_mem_abort+0x4c/0xa8
[   76.258532]  x19: fffffc0008552800
[   76.258883]  el0_da+0x2c/0x78
[   76.259263]  x18: 0000000000000010
[   76.259616]  el0t_64_sync_handler+0xe4/0x158
[   76.259933] 
[   76.260286]  el0t_64_sync+0x190/0x198
[   76.260729] x17: 3030303030303020 x16: 6666666666666666 x15: 3030303030303030
[   76.262010] x14: 0000000000000000 x13: 7465732029732867 x12: 616c662045455246
[   76.262746] x11: 5f54415f4b434548 x10: ffff800082e8bff8 x9 : ffff8000801276ac
[   76.263462] x8 : 00000000ffffefff x7 : ffff800082e8bff8 x6 : 0000000000000000
[   76.264182] x5 : ffff0013f5eb9d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.264903] x2 : 0000000000000000 x1 : ffff0000c105d640 x0 : 000000000000003e
[   76.265604] Call trace:
[   76.265865]  migrate_folio_done+0x140/0x150
[   76.266278]  migrate_pages_batch+0x9ec/0xff0
[   76.266716]  migrate_pages+0xd20/0xe20
[   76.267103]  compact_zone+0x7b4/0x1000
[   76.267460]  kcompactd_do_work+0x174/0x4d8
[   76.267869]  kcompactd+0x26c/0x418
[   76.268175]  kthread+0x120/0x130
[   76.268517]  ret_from_fork+0x10/0x20
[   76.268892] Code: aa1303e0 b000d161 9100c021 97fe0465 (d4210000) 
[   76.269447] ---[ end trace 0000000000000000 ]---
[   76.269893] note: kcompactd0[62] exited with irqs disabled
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
[   76.272521] flags: 0xbfffc0000080058(uptodate|dirty|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
[   76.273265] page_type: 0xffffffff()
[   76.273542] raw: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.274368] raw: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.275043] head: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.275651] head: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.276407] head: 0bfffc0000000000 0000000000000000 fffffc0008552848 0000000000000000
[   76.277064] head: 0000001000000000 0000000000000000 00000000ffffffff 0000000000000000
[   76.277784] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[   76.278502] ------------[ cut here ]------------
[   76.278893] kernel BUG at include/linux/mm.h:1120!
[   76.279269] Internal error: Oops - BUG: 00000000f2000800 [#2] PREEMPT SMP
[   76.280144] Modules linked in:
[   76.280401] CPU: 6 PID: 1337 Comm: usemem Tainted: G    B D            6.8.0-rc5-00456-ga7f399ae964e #16
[   76.281214] Hardware name: linux,dummy-virt (DT)
[   76.281635] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.282256] pc : deferred_split_scan+0x2f0/0x340
[   76.282698] lr : deferred_split_scan+0x2f0/0x340
[   76.283082] sp : ffff80008681b830
[   76.283426] x29: ffff80008681b830 x28: ffff0000cd4fb3c0 x27: fffffc0008552800
[   76.284113] x26: 0000000000000001 x25: 00000000ffffffff x24: 0000000000000001
[   76.284914] x23: 0000000000000000 x22: fffffc0008552800 x21: ffff0000e9df7820
[   76.285590] x20: ffff80008681b898 x19: ffff0000e9df7818 x18: 0000000000000000
[   76.286271] x17: 0000000000000001 x16: 0000000000000001 x15: ffff0000c0617210
[   76.286927] x14: ffff0000c10b6558 x13: 0000000000000040 x12: 0000000000000228
[   76.287543] x11: 0000000000000040 x10: 0000000000000a90 x9 : ffff800080220ed8
[   76.288176] x8 : ffff0000cd4fbeb0 x7 : 0000000000000000 x6 : 0000000000000000
[   76.288842] x5 : ffff0013f5f35d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.289538] x2 : 0000000000000000 x1 : ffff0000cd4fb3c0 x0 : 000000000000003e
[   76.290201] Call trace:
[   76.290432]  deferred_split_scan+0x2f0/0x340
[   76.290856]  do_shrink_slab+0x144/0x460
[   76.291221]  shrink_slab+0x2e0/0x4e0
[   76.291513]  shrink_node+0x204/0x8a0
[   76.291831]  do_try_to_free_pages+0xd0/0x568
[   76.292192]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.292599]  try_charge_memcg+0x12c/0x650
[   76.292926]  __mem_cgroup_charge+0x6c/0xd0
[   76.293289]  __handle_mm_fault+0xe90/0x16a8
[   76.293713]  handle_mm_fault+0x70/0x2b0
[   76.294031]  do_page_fault+0x100/0x4c0
[   76.294343]  do_translation_fault+0xb4/0xd0
[   76.294694]  do_mem_abort+0x4c/0xa8
[   76.294968]  el0_da+0x2c/0x78
[   76.295202]  el0t_64_sync_handler+0xe4/0x158
[   76.295565]  el0t_64_sync+0x190/0x198
[   76.295860] Code: aa1603e0 d000d0e1 9100c021 97fdc715 (d4210000) 
[   76.296429] ---[ end trace 0000000000000000 ]---
[   76.296805] note: usemem[1337] exited with irqs disabled
[   76.297261] note: usemem[1337] exited with preempt_count 1



My test case is intended to stress swap:

  - Running in VM (on Ampere Altra) with 70 vCPUs and 80G RAM
  - Have a 35G block ram device (CONFIG_BLK_DEV_RAM & "brd.rd_nr=1 brd.rd_size=36700160")
  - the ramdisk is configured as the swap backend
  - run the test case in a memcg constrained to 40G (to force mem pressure)
  - test case has 70 processes, each allocating and writing 1G of RAM


swapoff -a
mkswap /dev/ram0
swapon -f /dev/ram0
cgcreate -g memory:/mmperfcgroup
echo 40G > /sys/fs/cgroup/mmperfcgroup/memory.max
cgexec -g memory:mmperfcgroup sudo -u $(whoami) bash

Then inside that second bash shell, run this script:

--8<---
function run_usemem_once {
        ./usemem -n 70 -O 1G | grep -v "free memory"
}

function run_usemem_multi {
        size=${1}
        for i in {1..2}; do
                echo "${size} THP ${i}"
                run_usemem_once
        done
}

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
run_usemem_multi "64K"
--8<---

It will usually get through the first iteration of the loop in run_usemem_multi() and fail on the second. I've never seen it get all the way through both iterations.

"usemem" is from the vm-scalability suite. It just allocates and writes loads of anonymous memory (70 is concurrent processes, 1G is the amount of memory per process). Then the memory pressure from the cgroup causes lots of swap to happen.

> ---
>  mm/swap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index dce5ea67ae05..6b697d33fa5b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>  		if (!folio_ref_sub_and_test(folio, nr_refs))
>  			continue;
>  
> -		if (folio_test_large(folio)) {
> +		/* hugetlb has its own memcg */
> +		if (folio_test_hugetlb(folio)) {

This still looks reasonable to me after re-review, so I have no idea what the problem is? I recall seeing some weird crashes when I looked at this original RFC, but didn't have time to debug at the time. I wonder if the root cause is the same.

If you find a smoking gun, I'm happy to test it if the above is too painful to reproduce.

Thanks,
Ryan

>  			if (lruvec) {
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			__folio_put_large(folio);
> +			free_huge_folio(folio);
>  			continue;
>  		}
>  



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 13:42   ` Ryan Roberts
@ 2024-03-06 16:09     ` Matthew Wilcox
  2024-03-06 16:19       ` Ryan Roberts
  2024-03-10 11:01       ` Ryan Roberts
  0 siblings, 2 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-06 16:09 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
> When running some swap tests with this change (which is in mm-stable)
> present, I see BadThings(TM). Usually I see a "bad page state"
> followed by a delay of a few seconds, followed by an oops or NULL
> pointer deref. Bisect points to this change, and if I revert it,
> the problem goes away.

That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
the same time and it's all interleaved.  That said, I can pick some
nuggets out of it.

> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
> [   76.240196] kernel BUG at include/linux/mm.h:1120!

These are the two different BUGs being called simultaneously ...

The first one is bad_page() in page_alloc.c and the second is
put_page_testzero()
        VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);

I'm sure it's significant that both of these are the same page (pfn
2554a0).  Feels like we have two CPUs calling put_folio() at the same
time, and one of them underflows.  It probably doesn't matter which call
trace ends up in bad_page() and which in put_page_testzero().

One of them is coming from deferred_split_scan(), which is weird because
we can see the folio_try_get() earlier in the function.  So whatever
this folio was, we found it on the deferred split list, got its refcount,
moved it to the local list, either failed to get the lock, or
successfully got the lock, split it, unlocked it and put it.

(I can see this was invoked from page fault -> memcg shrinking.  That's
probably irrelevant but explains some of the functions in the backtrace)

The other call trace comes from migrate_folio_done() where we're putting
the _source_ folio.  That was called from migrate_pages_batch() which
was called from kcompactd.

Um.  Where do we handle the deferred list in the migration code?


I've also tried looking at this from a different angle -- what is it
about this commit that produces this problem?  It's a fairly small
commit:

-               if (folio_test_large(folio)) {
+               /* hugetlb has its own memcg */
+               if (folio_test_hugetlb(folio)) {
                        if (lruvec) {
                                unlock_page_lruvec_irqrestore(lruvec, flags);
                                lruvec = NULL;
                        }
-                       __folio_put_large(folio);
+                       free_huge_folio(folio);

So all that's changed is that large non-hugetlb folios do not call
__folio_put_large().  As a reminder, that function does:

        if (!folio_test_hugetlb(folio))
                page_cache_release(folio);
        destroy_large_folio(folio);

and destroy_large_folio() does:
        if (folio_test_large_rmappable(folio))
                folio_undo_large_rmappable(folio);

        mem_cgroup_uncharge(folio);
        free_the_page(&folio->page, folio_order(folio));

So after my patch, instead of calling (in order):

	page_cache_release(folio);
	folio_undo_large_rmappable(folio);
	mem_cgroup_uncharge(folio);
	free_unref_page()

it calls:

	__page_cache_release(folio, &lruvec, &flags);
	mem_cgroup_uncharge_folios()
	folio_undo_large_rmappable(folio);

So have I simply widened the window for this race, whatever it is
exactly?  Something involving mis-handling of the deferred list?



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 16:09     ` Matthew Wilcox
@ 2024-03-06 16:19       ` Ryan Roberts
  2024-03-06 17:41         ` Ryan Roberts
  2024-03-10 11:01       ` Ryan Roberts
  1 sibling, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-06 16:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 06/03/2024 16:09, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>> When running some swap tests with this change (which is in mm-stable)
>> present, I see BadThings(TM). Usually I see a "bad page state"
>> followed by a delay of a few seconds, followed by an oops or NULL
>> pointer deref. Bisect points to this change, and if I revert it,
>> the problem goes away.
> 
> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
> the same time and it's all interleaved.  That said, I can pick some
> nuggets out of it.
> 
>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
> 
> These are the two different BUGs being called simultaneously ...
> 
> The first one is bad_page() in page_alloc.c and the second is
> put_page_testzero()
>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> 
> I'm sure it's significant that both of these are the same page (pfn
> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
> time, and one of them underflows.  It probably doesn't matter which call
> trace ends up in bad_page() and which in put_page_testzero().
> 
> One of them is coming from deferred_split_scan(), which is weird because
> we can see the folio_try_get() earlier in the function.  So whatever
> this folio was, we found it on the deferred split list, got its refcount,
> moved it to the local list, either failed to get the lock, or
> successfully got the lock, split it, unlocked it and put it.
> 
> (I can see this was invoked from page fault -> memcg shrinking.  That's
> probably irrelevant but explains some of the functions in the backtrace)
> 
> The other call trace comes from migrate_folio_done() where we're putting
> the _source_ folio.  That was called from migrate_pages_batch() which
> was called from kcompactd.
> 
> Um.  Where do we handle the deferred list in the migration code?
> 
> 
> I've also tried looking at this from a different angle -- what is it
> about this commit that produces this problem?  It's a fairly small
> commit:
> 
> -               if (folio_test_large(folio)) {
> +               /* hugetlb has its own memcg */
> +               if (folio_test_hugetlb(folio)) {
>                         if (lruvec) {
>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>                                 lruvec = NULL;
>                         }
> -                       __folio_put_large(folio);
> +                       free_huge_folio(folio);
> 
> So all that's changed is that large non-hugetlb folios do not call
> __folio_put_large().  As a reminder, that function does:
> 
>         if (!folio_test_hugetlb(folio))
>                 page_cache_release(folio);
>         destroy_large_folio(folio);
> 
> and destroy_large_folio() does:
>         if (folio_test_large_rmappable(folio))
>                 folio_undo_large_rmappable(folio);
> 
>         mem_cgroup_uncharge(folio);
>         free_the_page(&folio->page, folio_order(folio));
> 
> So after my patch, instead of calling (in order):
> 
> 	page_cache_release(folio);
> 	folio_undo_large_rmappable(folio);
> 	mem_cgroup_uncharge(folio);
> 	free_unref_page()
> 
> it calls:
> 
> 	__page_cache_release(folio, &lruvec, &flags);
> 	mem_cgroup_uncharge_folios()
> 	folio_undo_large_rmappable(folio);
> 
> So have I simply widened the window for this race 

Yes that's the conclusion I'm coming to. I have reverted this patch and am still
seeing what looks like the same problem very occasionally. (I was just about to
let you know when I saw this reply). It's much harder to reproduce now... great.

The original oops I reported against your RFC is here:
https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/

Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
whistles and see if I can get it to repro more reliably to bisect.

Assuming the original oops and this are related, that implies that the problem
is lurking somewhere in this series, if not this patch.

I'll come back to you shortly...

>, whatever it is
> exactly?  Something involving mis-handling of the deferred list?
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 16:19       ` Ryan Roberts
@ 2024-03-06 17:41         ` Ryan Roberts
  2024-03-06 18:41           ` Zi Yan
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-06 17:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 06/03/2024 16:19, Ryan Roberts wrote:
> On 06/03/2024 16:09, Matthew Wilcox wrote:
>> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>>> When running some swap tests with this change (which is in mm-stable)
>>> present, I see BadThings(TM). Usually I see a "bad page state"
>>> followed by a delay of a few seconds, followed by an oops or NULL
>>> pointer deref. Bisect points to this change, and if I revert it,
>>> the problem goes away.
>>
>> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
>> the same time and it's all interleaved.  That said, I can pick some
>> nuggets out of it.
>>
>>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
>>
>> These are the two different BUGs being called simultaneously ...
>>
>> The first one is bad_page() in page_alloc.c and the second is
>> put_page_testzero()
>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>
>> I'm sure it's significant that both of these are the same page (pfn
>> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
>> time, and one of them underflows.  It probably doesn't matter which call
>> trace ends up in bad_page() and which in put_page_testzero().
>>
>> One of them is coming from deferred_split_scan(), which is weird because
>> we can see the folio_try_get() earlier in the function.  So whatever
>> this folio was, we found it on the deferred split list, got its refcount,
>> moved it to the local list, either failed to get the lock, or
>> successfully got the lock, split it, unlocked it and put it.
>>
>> (I can see this was invoked from page fault -> memcg shrinking.  That's
>> probably irrelevant but explains some of the functions in the backtrace)
>>
>> The other call trace comes from migrate_folio_done() where we're putting
>> the _source_ folio.  That was called from migrate_pages_batch() which
>> was called from kcompactd.
>>
>> Um.  Where do we handle the deferred list in the migration code?
>>
>>
>> I've also tried looking at this from a different angle -- what is it
>> about this commit that produces this problem?  It's a fairly small
>> commit:
>>
>> -               if (folio_test_large(folio)) {
>> +               /* hugetlb has its own memcg */
>> +               if (folio_test_hugetlb(folio)) {
>>                         if (lruvec) {
>>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>>                                 lruvec = NULL;
>>                         }
>> -                       __folio_put_large(folio);
>> +                       free_huge_folio(folio);
>>
>> So all that's changed is that large non-hugetlb folios do not call
>> __folio_put_large().  As a reminder, that function does:
>>
>>         if (!folio_test_hugetlb(folio))
>>                 page_cache_release(folio);
>>         destroy_large_folio(folio);
>>
>> and destroy_large_folio() does:
>>         if (folio_test_large_rmappable(folio))
>>                 folio_undo_large_rmappable(folio);
>>
>>         mem_cgroup_uncharge(folio);
>>         free_the_page(&folio->page, folio_order(folio));
>>
>> So after my patch, instead of calling (in order):
>>
>> 	page_cache_release(folio);
>> 	folio_undo_large_rmappable(folio);
>> 	mem_cgroup_uncharge(folio);
>> 	free_unref_page()
>>
>> it calls:
>>
>> 	__page_cache_release(folio, &lruvec, &flags);
>> 	mem_cgroup_uncharge_folios()
>> 	folio_undo_large_rmappable(folio);
>>
>> So have I simply widened the window for this race 
> 
> Yes that's the conclusion I'm coming to. I have reverted this patch and am still
> seeing what looks like the same problem very occasionally. (I was just about to
> let you know when I saw this reply). It's much harder to reproduce now... great.
> 
> The original oops I reported against your RFC is here:
> https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/
> 
> Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
> whistles and see if I can get it to repro more reliably to bisect.
> 
> Assuming the original oops and this are related, that implies that the problem
> is lurking somewhere in this series, if not this patch.
> 
> I'll come back to you shortly...

Just a bunch of circumstantial observations, I'm afraid. No conclusions yet...

With this patch reverted:

- Haven't triggered with any of the sanitizers compiled in
- Have only triggered when my code is on top (swap-out mTHP)
- Have only triggered when compiled using GCC 12.2 (can't trigger with 11.4)

So perhaps I'm looking at 2 different things, with this new intermittent problem
caused by my changes. Or perhaps my changes increase the window significantly.

I have to go pick up my daughter now. Can look at this some more tomorrow, but
struggling for ideas - need a way to more reliably reproduce.

> 
>> , whatever it is
>> exactly?  Something involving mis-handling of the deferred list?
>>
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 17:41         ` Ryan Roberts
@ 2024-03-06 18:41           ` Zi Yan
  2024-03-06 19:55             ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Zi Yan @ 2024-03-06 18:41 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Yang Shi, Huang Ying

[-- Attachment #1: Type: text/plain, Size: 6172 bytes --]

On 6 Mar 2024, at 12:41, Ryan Roberts wrote:

> On 06/03/2024 16:19, Ryan Roberts wrote:
>> On 06/03/2024 16:09, Matthew Wilcox wrote:
>>> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>>>> When running some swap tests with this change (which is in mm-stable)
>>>> present, I see BadThings(TM). Usually I see a "bad page state"
>>>> followed by a delay of a few seconds, followed by an oops or NULL
>>>> pointer deref. Bisect points to this change, and if I revert it,
>>>> the problem goes away.
>>>
>>> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
>>> the same time and it's all interleaved.  That said, I can pick some
>>> nuggets out of it.
>>>
>>>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>>>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
>>>
>>> These are the two different BUGs being called simultaneously ...
>>>
>>> The first one is bad_page() in page_alloc.c and the second is
>>> put_page_testzero()
>>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>>
>>> I'm sure it's significant that both of these are the same page (pfn
>>> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
>>> time, and one of them underflows.  It probably doesn't matter which call
>>> trace ends up in bad_page() and which in put_page_testzero().
>>>
>>> One of them is coming from deferred_split_scan(), which is weird because
>>> we can see the folio_try_get() earlier in the function.  So whatever
>>> this folio was, we found it on the deferred split list, got its refcount,
>>> moved it to the local list, either failed to get the lock, or
>>> successfully got the lock, split it, unlocked it and put it.
>>>
>>> (I can see this was invoked from page fault -> memcg shrinking.  That's
>>> probably irrelevant but explains some of the functions in the backtrace)
>>>
>>> The other call trace comes from migrate_folio_done() where we're putting
>>> the _source_ folio.  That was called from migrate_pages_batch() which
>>> was called from kcompactd.
>>>
>>> Um.  Where do we handle the deferred list in the migration code?
>>>
>>>
>>> I've also tried looking at this from a different angle -- what is it
>>> about this commit that produces this problem?  It's a fairly small
>>> commit:
>>>
>>> -               if (folio_test_large(folio)) {
>>> +               /* hugetlb has its own memcg */
>>> +               if (folio_test_hugetlb(folio)) {
>>>                         if (lruvec) {
>>>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>>>                                 lruvec = NULL;
>>>                         }
>>> -                       __folio_put_large(folio);
>>> +                       free_huge_folio(folio);
>>>
>>> So all that's changed is that large non-hugetlb folios do not call
>>> __folio_put_large().  As a reminder, that function does:
>>>
>>>         if (!folio_test_hugetlb(folio))
>>>                 page_cache_release(folio);
>>>         destroy_large_folio(folio);
>>>
>>> and destroy_large_folio() does:
>>>         if (folio_test_large_rmappable(folio))
>>>                 folio_undo_large_rmappable(folio);
>>>
>>>         mem_cgroup_uncharge(folio);
>>>         free_the_page(&folio->page, folio_order(folio));
>>>
>>> So after my patch, instead of calling (in order):
>>>
>>> 	page_cache_release(folio);
>>> 	folio_undo_large_rmappable(folio);
>>> 	mem_cgroup_uncharge(folio);
>>> 	free_unref_page()
>>>
>>> it calls:
>>>
>>> 	__page_cache_release(folio, &lruvec, &flags);
>>> 	mem_cgroup_uncharge_folios()
>>> 	folio_undo_large_rmappable(folio);
>>>
>>> So have I simply widened the window for this race
>>
>> Yes that's the conclusion I'm coming to. I have reverted this patch and am still
>> seeing what looks like the same problem very occasionally. (I was just about to
>> let you know when I saw this reply). It's much harder to reproduce now... great.
>>
>> The original oops I reported against your RFC is here:
>> https://lore.kernel.org/linux-mm/eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com/
>>
>> Looks like I had UBSAN enabled for that run. Let me turn on all the bells and
>> whistles and see if I can get it to repro more reliably to bisect.
>>
>> Assuming the original oops and this are related, that implies that the problem
>> is lurking somewhere in this series, if not this patch.
>>
>> I'll come back to you shortly...
>
> Just a bunch of circumstantial observations, I'm afraid. No conclusions yet...
>
> With this patch reverted:
>
> - Haven't triggered with any of the sanitizers compiled in
> - Have only triggered when my code is on top (swap-out mTHP)
> - Have only triggered when compiled using GCC 12.2 (can't trigger with 11.4)
>
> So perhaps I'm looking at 2 different things, with this new intermittent problem
> caused by my changes. Or perhaps my changes increase the window significantly.
>
> I have to go pick up my daughter now. Can look at this some more tomorrow, but
> struggling for ideas - need a way to more reliably reproduce.
>
>>
>>> , whatever it is
>>> exactly?  Something involving mis-handling of the deferred list?

I had a chat with willy on the deferred list mis-handling. Current migration
code (starting from commit 616b8371539a6 ("mm: thp: enable thp migration in
generic path")) does not properly handle THP and mTHP on the deferred list.
So if the source folio is on the deferred list, after migration,
the destination folio will not. But this seems a benign bug, since
the opportunity of splitting a partially mapped THP/mTHP is gone.

In terms of potential races, the source folio refcount is elevated before
migration, deferred_split_scan() can move the folio off the deferred_list,
but cannot split it. During folio_migrate_mapping() when folio is frozen,
deferred_split_scan() cannot move the folio off the deferred_list to begin
with.

I am going to send a patch to fix the deferred_list handling in migration,
but it seems not be related to the bug in this email thread.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 18:41           ` Zi Yan
@ 2024-03-06 19:55             ` Matthew Wilcox
  2024-03-06 21:55               ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-06 19:55 UTC (permalink / raw)
  To: Zi Yan; +Cc: Ryan Roberts, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Wed, Mar 06, 2024 at 01:41:13PM -0500, Zi Yan wrote:
> I had a chat with willy on the deferred list mis-handling. Current migration
> code (starting from commit 616b8371539a6 ("mm: thp: enable thp migration in
> generic path")) does not properly handle THP and mTHP on the deferred list.
> So if the source folio is on the deferred list, after migration,
> the destination folio will not. But this seems a benign bug, since
> the opportunity of splitting a partially mapped THP/mTHP is gone.
> 
> In terms of potential races, the source folio refcount is elevated before
> migration, deferred_split_scan() can move the folio off the deferred_list,
> but cannot split it. During folio_migrate_mapping() when folio is frozen,
> deferred_split_scan() cannot move the folio off the deferred_list to begin
> with.
> 
> I am going to send a patch to fix the deferred_list handling in migration,
> but it seems not be related to the bug in this email thread.

... IOW the source folio remains on the deferred list until its
refcount goes to 0, at which point we call folio_undo_large_rmappable()
and remove it from the deferred list.

A different line of enquiry might be the "else /* We lost race with
folio_put() */" in deferred_split_scan().  If somebody froze the
refcount, we can lose track of a deferred-split folio.  But I think
that's OK too.  The only places which freeze a folio are vmscan (about
to free), folio_migrate_mapping() (discussed above), and page splitting.
In none of these cases do we want to keep the folio on the deferred
split list because we're either freeing it, migrating it or splitting
it.

Oh, and there's something in s390 that I can't be bothered to look at.


Hang on, I think I see it.  It is a race between folio freeing and
deferred_split_scan(), but page migration is absolved.  Look:

CPU 1: deferred_split_scan:
spin_lock_irqsave(split_queue_lock)
list_for_each_entry_safe()
folio_try_get()
list_move(&folio->_deferred_list, &list);
spin_unlock_irqrestore(split_queue_lock)
list_for_each_entry_safe() {
	folio_trylock() <- fails
	folio_put(folio);

CPU 2: folio_put:
folio_undo_large_rmappable
        ds_queue = get_deferred_split_queue(folio);
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
                list_del_init(&folio->_deferred_list);
*** at this point CPU 1 is not holding the split_queue_lock; the
folio is on the local list.  Which we just corrupted ***

Now anything can happen.  It's a pretty tight race that involves at
least two CPUs (CPU 2 might have been the one to have the folio locked
at the time CPU 1 caalled folio_trylock()).  But I definitely widened
the window by moving the decrement of the refcount and the removal from
the deferred list further apart.


OK, so what's the solution here?  Personally I favour using a
folio_batch in deferred_split_scan() to hold the folios that we're
going to try to remove instead of a linked list.  Other ideas that are
perhaps less intrusive?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 19:55             ` Matthew Wilcox
@ 2024-03-06 21:55               ` Matthew Wilcox
  2024-03-07  8:56                 ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-06 21:55 UTC (permalink / raw)
  To: Zi Yan; +Cc: Ryan Roberts, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
> Hang on, I think I see it.  It is a race between folio freeing and
> deferred_split_scan(), but page migration is absolved.  Look:
> 
> CPU 1: deferred_split_scan:
> spin_lock_irqsave(split_queue_lock)
> list_for_each_entry_safe()
> folio_try_get()
> list_move(&folio->_deferred_list, &list);
> spin_unlock_irqrestore(split_queue_lock)
> list_for_each_entry_safe() {
> 	folio_trylock() <- fails
> 	folio_put(folio);
> 
> CPU 2: folio_put:
> folio_undo_large_rmappable
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>                 list_del_init(&folio->_deferred_list);
> *** at this point CPU 1 is not holding the split_queue_lock; the
> folio is on the local list.  Which we just corrupted ***
> 
> Now anything can happen.  It's a pretty tight race that involves at
> least two CPUs (CPU 2 might have been the one to have the folio locked
> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
> the window by moving the decrement of the refcount and the removal from
> the deferred list further apart.
> 
> 
> OK, so what's the solution here?  Personally I favour using a
> folio_batch in deferred_split_scan() to hold the folios that we're
> going to try to remove instead of a linked list.  Other ideas that are
> perhaps less intrusive?

I looked at a few options, but I think we need to keep the refcount
elevated until we've got the folios back on the deferred split list.
And we can't call folio_put() while holding the split_queue_lock or
we'll deadlock.  So we need to maintain a list of folios that isn't
linked through deferred_list.  Anyway, this is basically untested,
except that it compiles.

Opinions?  Better patches?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..0120a47ea7a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
+	struct folio_batch batch;
 	struct folio *folio, *next;
 	int split = 0;
 
@@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&batch);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
-		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
-			/* We lost race with folio_put() */
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+		if (!folio_try_get(folio))
+			continue;
+		if (!folio_trylock(folio))
+			continue;
+		list_del_init(&folio->_deferred_list);
+		if (folio_batch_add(&batch, folio) == 0) {
+			--sc->nr_to_scan;
+			break;
 		}
 		if (!--sc->nr_to_scan)
 			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
+	while ((folio = folio_batch_next(&batch)) != NULL) {
 		if (!split_folio(folio))
 			split++;
 		folio_unlock(folio);
-next:
-		folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	while ((folio = folio_batch_next(&batch)) != NULL) {
+		if (!folio_test_large(folio))
+			continue;
+		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
+	folios_put(&batch);
+
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 21:55               ` Matthew Wilcox
@ 2024-03-07  8:56                 ` Ryan Roberts
  2024-03-07 13:50                   ` Yin, Fengwei
  2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
  0 siblings, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-07  8:56 UTC (permalink / raw)
  To: Matthew Wilcox, Zi Yan; +Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 06/03/2024 21:55, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
>> Hang on, I think I see it.  It is a race between folio freeing and
>> deferred_split_scan(), but page migration is absolved.  Look:
>>
>> CPU 1: deferred_split_scan:
>> spin_lock_irqsave(split_queue_lock)
>> list_for_each_entry_safe()
>> folio_try_get()
>> list_move(&folio->_deferred_list, &list);
>> spin_unlock_irqrestore(split_queue_lock)
>> list_for_each_entry_safe() {
>> 	folio_trylock() <- fails
>> 	folio_put(folio);
>>
>> CPU 2: folio_put:
>> folio_undo_large_rmappable
>>         ds_queue = get_deferred_split_queue(folio);
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>                 list_del_init(&folio->_deferred_list);
>> *** at this point CPU 1 is not holding the split_queue_lock; the
>> folio is on the local list.  Which we just corrupted ***

Wow, this would have taken me weeks...

I just want to make sure I've understood correctly: CPU1's folio_put() is not the last reference, and it keeps iterating through the local list. Then CPU2 does the final folio_put() which causes list_del_init() to modify the local list concurrently with CPU1's iteration, so CPU1 probably goes into the weeds?

>>
>> Now anything can happen.  It's a pretty tight race that involves at
>> least two CPUs (CPU 2 might have been the one to have the folio locked
>> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
>> the window by moving the decrement of the refcount and the removal from
>> the deferred list further apart.
>>
>>
>> OK, so what's the solution here?  Personally I favour using a
>> folio_batch in deferred_split_scan() to hold the folios that we're
>> going to try to remove instead of a linked list.  Other ideas that are
>> perhaps less intrusive?
> 
> I looked at a few options, but I think we need to keep the refcount
> elevated until we've got the folios back on the deferred split list.
> And we can't call folio_put() while holding the split_queue_lock or
> we'll deadlock.  So we need to maintain a list of folios that isn't
> linked through deferred_list.  Anyway, this is basically untested,
> except that it compiles.

If we can't call folio_put() under spinlock, then I agree.

> 
> Opinions?  Better patches?

I assume the fact that one scan is limited to freeing a batch-worth of folios is not a problem? The shrinker will keep calling while there are folios on the deferred list?

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..0120a47ea7a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> -			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +		if (!folio_try_get(folio))
> +			continue;
> +		if (!folio_trylock(folio))
> +			continue;
> +		list_del_init(&folio->_deferred_list);
> +		if (folio_batch_add(&batch, folio) == 0) {
> +			--sc->nr_to_scan;
> +			break;
>  		}
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> -		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
> +		if (!folio_test_large(folio))
> +			continue;
> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> +	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> +	folios_put(&batch);
> +
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>  	 * This can happen if pages were freed under us.

I've added this patch to my branch and tested (still without the patch that I fingered as the culprit originally, for now). Unfortuantely it is still blowing up at about the same rate, although it looks very different now. I've seen bad things twice. The first time was RCU stalls, but systemd had turned the log level down so no stack trace and I didn't manage to get any further information. The second time, this:

[  338.519401] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519402] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519407] Mem abort info:
[  338.519407]   ESR = 0x0000000096000004
[  338.519408]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519588] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519591] Mem abort info:
[  338.519592]   ESR = 0x0000000096000004
[  338.519593]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519594]   SET = 0, FnV = 0
[  338.519595]   EA = 0, S1PTW = 0
[  338.519596]   FSC = 0x04: level 0 translation fault
[  338.519597] Data abort info:
[  338.519597]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  338.519598]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  338.519599]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  338.519600] [fffc001b13a8c870] address between user and kernel address ranges
[  338.519602] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  338.519605] Modules linked in:
[  338.519607] CPU: 43 PID: 3234 Comm: usemem Not tainted 6.8.0-rc5-00465-g279cb41b481e-dirty #3
[  338.519610] Hardware name: linux,dummy-virt (DT)
[  338.519611] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  338.519613] pc : down_read_trylock+0x2c/0xd0
[  338.519618] lr : folio_lock_anon_vma_read+0x74/0x2c8
[  338.519623] sp : ffff800087f935c0
[  338.519623] x29: ffff800087f935c0 x28: 0000000000000000 x27: ffff800087f937e0
[  338.519626] x26: 0000000000000001 x25: ffff800087f937a8 x24: fffffc0007258180
[  338.519628] x23: ffff800087f936c8 x22: fffc001b13a8c870 x21: ffff0000f7d51d69
[  338.519630] x20: ffff0000f7d51d68 x19: fffffc0007258180 x18: 0000000000000000
[  338.519632] x17: 0000000000000001 x16: ffff0000c90ab458 x15: 0000000000000040
[  338.519634] x14: ffff0000c8c7b558 x13: 0000000000000228 x12: 000040f22f534640
[  338.519637] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080338b3c
[  338.519639] x8 : ffff800087f93618 x7 : 0000000000000000 x6 : ffff0000c9692f50
[  338.519641] x5 : ffff800087f936b0 x4 : 0000000000000001 x3 : ffff0000d70d9140
[  338.519643] x2 : 0000000000000001 x1 : fffc001b13a8c870 x0 : fffc001b13a8c870
[  338.519645] Call trace:
[  338.519646]  down_read_trylock+0x2c/0xd0
[  338.519648]  folio_lock_anon_vma_read+0x74/0x2c8
[  338.519650]  rmap_walk_anon+0x1d8/0x2c0
[  338.519652]  folio_referenced+0x1b4/0x1e0
[  338.519655]  shrink_folio_list+0x768/0x10c8
[  338.519658]  shrink_lruvec+0x5dc/0xb30
[  338.519660]  shrink_node+0x4d8/0x8b0
[  338.519662]  do_try_to_free_pages+0xe0/0x5a8
[  338.519665]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[  338.519667]  try_charge_memcg+0x114/0x658
[  338.519671]  __mem_cgroup_charge+0x6c/0xd0
[  338.519672]  __handle_mm_fault+0x42c/0x1640
[  338.519675]  handle_mm_fault+0x70/0x290
[  338.519677]  do_page_fault+0xfc/0x4d8
[  338.519681]  do_translation_fault+0xa4/0xc0
[  338.519682]  do_mem_abort+0x4c/0xa8
[  338.519685]  el0_da+0x2c/0x78
[  338.519687]  el0t_64_sync_handler+0xb8/0x130
[  338.519689]  el0t_64_sync+0x190/0x198
[  338.519692] Code: aa0003e1 b9400862 11000442 b9000862 (f9400000) 
[  338.519693] ---[ end trace 0000000000000000 ]---

The fault is when trying to do an atomic_long_read(&sem->count) here:

struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
					  struct rmap_walk_control *rwc)
{
	struct anon_vma *anon_vma = NULL;
	struct anon_vma *root_anon_vma;
	unsigned long anon_mapping;

retry:
	rcu_read_lock();
	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	if (!folio_mapped(folio))
		goto out;

	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
	root_anon_vma = READ_ONCE(anon_vma->root);
	if (down_read_trylock(&root_anon_vma->rwsem)) { <<<<<<<

I guess we are still corrupting folios?




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07  8:56                 ` Ryan Roberts
@ 2024-03-07 13:50                   ` Yin, Fengwei
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
  1 sibling, 1 reply; 73+ messages in thread
From: Yin, Fengwei @ 2024-03-07 13:50 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Zi Yan
  Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 4:56 PM,  wrote:
> I just want to make sure I've understood correctly: CPU1's folio_put()
> is not the last reference, and it keeps iterating through the local
> list. Then CPU2 does the final folio_put() which causes list_del_init()
> to modify the local list concurrently with CPU1's iteration, so CPU1
> probably goes into the weeds?

My understanding is this can not corrupt the folio->deferred_list as
this folio was iterated already.


But I did see other strange thing:
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 
index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0

This large folio has order 0? Maybe folio->_flags_1 was screwed?

In free_unref_folios(), there is code like following:
                 if (order > 0 && folio_test_large_rmappable(folio))
                         folio_undo_large_rmappable(folio);

But with destroy_large_folio():
         if (folio_test_large_rmappable(folio)) 

			folio_undo_large_rmappable(folio);

Can it connect to the folio has zero refcount still in deferred list
with Matthew's patch?


Looks like folio order was cleared unexpected somewhere.

Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07 13:50                   ` Yin, Fengwei
@ 2024-03-07 14:05                     ` Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  0 siblings, 2 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-07 14:05 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
> 
> 
> On 3/7/2024 4:56 PM,  wrote:
> > I just want to make sure I've understood correctly: CPU1's folio_put()
> > is not the last reference, and it keeps iterating through the local
> > list. Then CPU2 does the final folio_put() which causes list_del_init()
> > to modify the local list concurrently with CPU1's iteration, so CPU1
> > probably goes into the weeds?
> 
> My understanding is this can not corrupt the folio->deferred_list as
> this folio was iterated already.

I am not convinced about that at all.  It's possible this isn't the only
problem, but deleting something from a list without holding (the correct)
lock is something you have to think incredibly hard about to get right.
I didn't bother going any deeper into the analysis once I spotted the
locking problem, but the proof is very much on you that this is not a bug!

> But I did see other strange thing:
> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
> index:0xffffbd0a0 pfn:0x2554a0
> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
> 
> This large folio has order 0? Maybe folio->_flags_1 was screwed?
> 
> In free_unref_folios(), there is code like following:
>                 if (order > 0 && folio_test_large_rmappable(folio))
>                         folio_undo_large_rmappable(folio);
> 
> But with destroy_large_folio():
>         if (folio_test_large_rmappable(folio))
> 
> 			folio_undo_large_rmappable(folio);
> 
> Can it connect to the folio has zero refcount still in deferred list
> with Matthew's patch?
> 
> 
> Looks like folio order was cleared unexpected somewhere.

No, we intentionally clear it:

free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
page[1].flags &= ~PAGE_FLAGS_SECOND;

PAGE_FLAGS_SECOND includes the order, which is why we have to save it
away in folio->private so that we know what it is in the second loop.
So it's always been cleared by the time we call free_page_is_bad().


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
@ 2024-03-07 15:24                       ` Ryan Roberts
  2024-03-07 16:24                         ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  1 sibling, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-07 15:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 14:05, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>                         folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>         if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.

I think there could be something to this...

I have a set up where, when running with Matthew's deferred split fix AND have
commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
everything works as expected. And at the end, I have the expected amount of
memory free (seen in meminfo and buddyinfo).

But if I run only with the deferred split fix and DO NOT revert the other
change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
stalls where I can't even interact on the serial port. Sometimes (more usually)
everything just gets stuck trying to reclaim and allocate memory. And when I
kill the jobs, I still have barely any memory in the system - about 10% what I
would expect.

So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
only telling it about the first 4K page? So we end up leaking the rest?

> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-07 16:24                         ` Ryan Roberts
  2024-03-07 23:02                           ` Re: Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-07 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 15:24, Ryan Roberts wrote:
> On 07/03/2024 14:05, Matthew Wilcox wrote:
>> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/7/2024 4:56 PM,  wrote:
>>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>>> is not the last reference, and it keeps iterating through the local
>>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>>> probably goes into the weeds?
>>>
>>> My understanding is this can not corrupt the folio->deferred_list as
>>> this folio was iterated already.
>>
>> I am not convinced about that at all.  It's possible this isn't the only
>> problem, but deleting something from a list without holding (the correct)
>> lock is something you have to think incredibly hard about to get right.
>> I didn't bother going any deeper into the analysis once I spotted the
>> locking problem, but the proof is very much on you that this is not a bug!
>>
>>> But I did see other strange thing:
>>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>>> index:0xffffbd0a0 pfn:0x2554a0
>>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>>
>>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>>
>>> In free_unref_folios(), there is code like following:
>>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>>                         folio_undo_large_rmappable(folio);
>>>
>>> But with destroy_large_folio():
>>>         if (folio_test_large_rmappable(folio))
>>>
>>> 			folio_undo_large_rmappable(folio);
>>>
>>> Can it connect to the folio has zero refcount still in deferred list
>>> with Matthew's patch?
>>>
>>>
>>> Looks like folio order was cleared unexpected somewhere.
> 
> I think there could be something to this...
> 
> I have a set up where, when running with Matthew's deferred split fix AND have
> commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
> everything works as expected. And at the end, I have the expected amount of
> memory free (seen in meminfo and buddyinfo).
> 
> But if I run only with the deferred split fix and DO NOT revert the other
> change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> stalls where I can't even interact on the serial port. Sometimes (more usually)
> everything just gets stuck trying to reclaim and allocate memory. And when I
> kill the jobs, I still have barely any memory in the system - about 10% what I
> would expect.
> 
> So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
> free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
> only telling it about the first 4K page? So we end up leaking the rest?

I notice that before the commit, large folios are uncharged with
__mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().

The former has an upfront check:

	if (!folio_memcg(folio))
		return;

I'm not exactly sure what that's checking but could the fact this is missing
after the change cause things to go wonky?


> 
>>
>> No, we intentionally clear it:
>>
>> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
>> page[1].flags &= ~PAGE_FLAGS_SECOND;
>>
>> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
>> away in folio->private so that we know what it is in the second loop.
>> So it's always been cleared by the time we call free_page_is_bad().
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-07  8:56                 ` Ryan Roberts
  2024-03-07 13:50                   ` Yin, Fengwei
@ 2024-03-07 17:33                   ` Matthew Wilcox
  2024-03-07 18:35                     ` Ryan Roberts
  2024-03-08 11:44                     ` Ryan Roberts
  1 sibling, 2 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-07 17:33 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 08:56:27AM +0000, Ryan Roberts wrote:
> On 06/03/2024 21:55, Matthew Wilcox wrote:
> > On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
> >> Hang on, I think I see it.  It is a race between folio freeing and
> >> deferred_split_scan(), but page migration is absolved.  Look:
> >>
> >> CPU 1: deferred_split_scan:
> >> spin_lock_irqsave(split_queue_lock)
> >> list_for_each_entry_safe()
> >> folio_try_get()
> >> list_move(&folio->_deferred_list, &list);
> >> spin_unlock_irqrestore(split_queue_lock)
> >> list_for_each_entry_safe() {
> >> 	folio_trylock() <- fails
> >> 	folio_put(folio);
> >>
> >> CPU 2: folio_put:
> >> folio_undo_large_rmappable
> >>         ds_queue = get_deferred_split_queue(folio);
> >>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >>                 list_del_init(&folio->_deferred_list);
> >> *** at this point CPU 1 is not holding the split_queue_lock; the
> >> folio is on the local list.  Which we just corrupted ***
> 
> Wow, this would have taken me weeks...

It certainly took me hours of staring at the code ...

> I just want to make sure I've understood correctly: CPU1's folio_put()
> is not the last reference, and it keeps iterating through the local
> list. Then CPU2 does the final folio_put() which causes list_del_init()
> to modify the local list concurrently with CPU1's iteration, so CPU1
> probably goes into the weeds?

That is my suggestion for what the problem is, yes.

> > I looked at a few options, but I think we need to keep the refcount
> > elevated until we've got the folios back on the deferred split list.
> > And we can't call folio_put() while holding the split_queue_lock or
> > we'll deadlock.  So we need to maintain a list of folios that isn't
> > linked through deferred_list.  Anyway, this is basically untested,
> > except that it compiles.
> 
> If we can't call folio_put() under spinlock, then I agree.
> 
> > 
> > Opinions?  Better patches?
> 
> I assume the fact that one scan is limited to freeing a batch-worth of folios is not a problem? The shrinker will keep calling while there are folios on the deferred list?

I don't think it's a problem.  There's no particular requirement as to
how much work a shrinker does, just that it tries to make some progress
(afaik).

> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index fd745bcc97ff..0120a47ea7a1 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> >  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
> >  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
> >  	unsigned long flags;
> > -	LIST_HEAD(list);
> > +	struct folio_batch batch;
> >  	struct folio *folio, *next;
> >  	int split = 0;
> >  
> > @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> >  		ds_queue = &sc->memcg->deferred_split_queue;
> >  #endif
> >  
> > +	folio_batch_init(&batch);
> >  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >  	/* Take pin on all head pages to avoid freeing them under us */
> >  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
> >  							_deferred_list) {
> > -		if (folio_try_get(folio)) {
> > -			list_move(&folio->_deferred_list, &list);
> > -		} else {
> > -			/* We lost race with folio_put() */
> > -			list_del_init(&folio->_deferred_list);
> > -			ds_queue->split_queue_len--;
> > +		if (!folio_try_get(folio))
> > +			continue;
> > +		if (!folio_trylock(folio))
> > +			continue;
> > +		list_del_init(&folio->_deferred_list);
> > +		if (folio_batch_add(&batch, folio) == 0) {
> > +			--sc->nr_to_scan;
> > +			break;
> >  		}
> >  		if (!--sc->nr_to_scan)
> >  			break;
> >  	}
> >  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >  
> > -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> > -		if (!folio_trylock(folio))
> > -			goto next;
> > -		/* split_huge_page() removes page from list on success */
> > +	while ((folio = folio_batch_next(&batch)) != NULL) {
> >  		if (!split_folio(folio))
> >  			split++;
> >  		folio_unlock(folio);
> > -next:
> > -		folio_put(folio);
> >  	}
> >  
> >  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > -	list_splice_tail(&list, &ds_queue->split_queue);
> > +	while ((folio = folio_batch_next(&batch)) != NULL) {
> > +		if (!folio_test_large(folio))
> > +			continue;
> > +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> > +	}
> >  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >  
> > +	folios_put(&batch);
> > +
> >  	/*
> >  	 * Stop shrinker if we didn't split any page, but the queue is empty.
> >  	 * This can happen if pages were freed under us.
> 
> I've added this patch to my branch and tested (still without the patch that I fingered as the culprit originally, for now). Unfortuantely it is still blowing up at about the same rate, although it looks very different now. I've seen bad things twice. The first time was RCU stalls, but systemd had turned the log level down so no stack trace and I didn't manage to get any further information. The second time, this:
> 
> [  338.519401] Unable to handle kernel paging request at virtual address fffc001b13a8c870
> [  338.519402] Unable to handle kernel paging request at virtual address fffc001b13a8c870
> [  338.519407] Mem abort info:
> [  338.519407]   ESR = 0x0000000096000004
> [  338.519408]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  338.519588] Unable to handle kernel paging request at virtual address fffc001b13a8c870
> [  338.519591] Mem abort info:
> [  338.519592]   ESR = 0x0000000096000004
> [  338.519593]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  338.519594]   SET = 0, FnV = 0
> [  338.519595]   EA = 0, S1PTW = 0
> [  338.519596]   FSC = 0x04: level 0 translation fault
> [  338.519597] Data abort info:
> [  338.519597]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [  338.519598]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [  338.519599]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [  338.519600] [fffc001b13a8c870] address between user and kernel address ranges
> [  338.519602] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [  338.519605] Modules linked in:
> [  338.519607] CPU: 43 PID: 3234 Comm: usemem Not tainted 6.8.0-rc5-00465-g279cb41b481e-dirty #3
> [  338.519610] Hardware name: linux,dummy-virt (DT)
> [  338.519611] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  338.519613] pc : down_read_trylock+0x2c/0xd0
> [  338.519618] lr : folio_lock_anon_vma_read+0x74/0x2c8
> [  338.519623] sp : ffff800087f935c0
> [  338.519623] x29: ffff800087f935c0 x28: 0000000000000000 x27: ffff800087f937e0
> [  338.519626] x26: 0000000000000001 x25: ffff800087f937a8 x24: fffffc0007258180
> [  338.519628] x23: ffff800087f936c8 x22: fffc001b13a8c870 x21: ffff0000f7d51d69
> [  338.519630] x20: ffff0000f7d51d68 x19: fffffc0007258180 x18: 0000000000000000
> [  338.519632] x17: 0000000000000001 x16: ffff0000c90ab458 x15: 0000000000000040
> [  338.519634] x14: ffff0000c8c7b558 x13: 0000000000000228 x12: 000040f22f534640
> [  338.519637] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080338b3c
> [  338.519639] x8 : ffff800087f93618 x7 : 0000000000000000 x6 : ffff0000c9692f50
> [  338.519641] x5 : ffff800087f936b0 x4 : 0000000000000001 x3 : ffff0000d70d9140
> [  338.519643] x2 : 0000000000000001 x1 : fffc001b13a8c870 x0 : fffc001b13a8c870
> [  338.519645] Call trace:
> [  338.519646]  down_read_trylock+0x2c/0xd0
> [  338.519648]  folio_lock_anon_vma_read+0x74/0x2c8
> [  338.519650]  rmap_walk_anon+0x1d8/0x2c0
> [  338.519652]  folio_referenced+0x1b4/0x1e0
> [  338.519655]  shrink_folio_list+0x768/0x10c8
> [  338.519658]  shrink_lruvec+0x5dc/0xb30
> [  338.519660]  shrink_node+0x4d8/0x8b0
> [  338.519662]  do_try_to_free_pages+0xe0/0x5a8
> [  338.519665]  try_to_free_mem_cgroup_pages+0x128/0x2d0
> [  338.519667]  try_charge_memcg+0x114/0x658
> [  338.519671]  __mem_cgroup_charge+0x6c/0xd0
> [  338.519672]  __handle_mm_fault+0x42c/0x1640
> [  338.519675]  handle_mm_fault+0x70/0x290
> [  338.519677]  do_page_fault+0xfc/0x4d8
> [  338.519681]  do_translation_fault+0xa4/0xc0
> [  338.519682]  do_mem_abort+0x4c/0xa8
> [  338.519685]  el0_da+0x2c/0x78
> [  338.519687]  el0t_64_sync_handler+0xb8/0x130
> [  338.519689]  el0t_64_sync+0x190/0x198
> [  338.519692] Code: aa0003e1 b9400862 11000442 b9000862 (f9400000) 
> [  338.519693] ---[ end trace 0000000000000000 ]---
> 
> The fault is when trying to do an atomic_long_read(&sem->count) here:
> 
> struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
> 					  struct rmap_walk_control *rwc)
> {
> 	struct anon_vma *anon_vma = NULL;
> 	struct anon_vma *root_anon_vma;
> 	unsigned long anon_mapping;
> 
> retry:
> 	rcu_read_lock();
> 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> 		goto out;
> 	if (!folio_mapped(folio))
> 		goto out;
> 
> 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> 	root_anon_vma = READ_ONCE(anon_vma->root);
> 	if (down_read_trylock(&root_anon_vma->rwsem)) { <<<<<<<
> 
> I guess we are still corrupting folios?

I guess so ...

The thought occurs that we don't need to take the folios off the list.
I don't know that will fix anything, but this will fix your "running out
of memory" problem -- I forgot to drop the reference if folio_trylock()
failed.  Of course, I can't call folio_put() inside the lock, so may
as well move the trylock back to the second loop.

Again, compile-tessted only.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..4a2ab17f802d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
+	struct folio_batch batch;
 	struct folio *folio, *next;
 	int split = 0;
 
@@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&batch);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	/* Take pin on all head pages to avoid freeing them under us */
+	/* Take ref on all folios to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
-		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
-			/* We lost race with folio_put() */
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+		if (!folio_try_get(folio))
+			continue;
+		if (folio_batch_add(&batch, folio) == 0) {
+			--sc->nr_to_scan;
+			break;
 		}
 		if (!--sc->nr_to_scan)
 			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	while ((folio = folio_batch_next(&batch)) != NULL) {
 		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
+			continue;
 		if (!split_folio(folio))
 			split++;
 		folio_unlock(folio);
-next:
-		folio_put(folio);
 	}
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	folios_put(&batch);
 
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
@ 2024-03-07 18:35                     ` Ryan Roberts
  2024-03-07 20:42                       ` Matthew Wilcox
  2024-03-08 11:44                     ` Ryan Roberts
  1 sibling, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-07 18:35 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 17:33, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 08:56:27AM +0000, Ryan Roberts wrote:
>> On 06/03/2024 21:55, Matthew Wilcox wrote:
>>> On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
>>>> Hang on, I think I see it.  It is a race between folio freeing and
>>>> deferred_split_scan(), but page migration is absolved.  Look:
>>>>
>>>> CPU 1: deferred_split_scan:
>>>> spin_lock_irqsave(split_queue_lock)
>>>> list_for_each_entry_safe()
>>>> folio_try_get()
>>>> list_move(&folio->_deferred_list, &list);
>>>> spin_unlock_irqrestore(split_queue_lock)
>>>> list_for_each_entry_safe() {
>>>> 	folio_trylock() <- fails
>>>> 	folio_put(folio);
>>>>
>>>> CPU 2: folio_put:
>>>> folio_undo_large_rmappable
>>>>         ds_queue = get_deferred_split_queue(folio);
>>>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>>                 list_del_init(&folio->_deferred_list);
>>>> *** at this point CPU 1 is not holding the split_queue_lock; the
>>>> folio is on the local list.  Which we just corrupted ***
>>
>> Wow, this would have taken me weeks...
> 
> It certainly took me hours of staring at the code ...
> 
>> I just want to make sure I've understood correctly: CPU1's folio_put()
>> is not the last reference, and it keeps iterating through the local
>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>> to modify the local list concurrently with CPU1's iteration, so CPU1
>> probably goes into the weeds?
> 
> That is my suggestion for what the problem is, yes.
> 
>>> I looked at a few options, but I think we need to keep the refcount
>>> elevated until we've got the folios back on the deferred split list.
>>> And we can't call folio_put() while holding the split_queue_lock or
>>> we'll deadlock.  So we need to maintain a list of folios that isn't
>>> linked through deferred_list.  Anyway, this is basically untested,
>>> except that it compiles.
>>
>> If we can't call folio_put() under spinlock, then I agree.
>>
>>>
>>> Opinions?  Better patches?
>>
>> I assume the fact that one scan is limited to freeing a batch-worth of folios is not a problem? The shrinker will keep calling while there are folios on the deferred list?
> 
> I don't think it's a problem.  There's no particular requirement as to
> how much work a shrinker does, just that it tries to make some progress
> (afaik).
> 
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index fd745bcc97ff..0120a47ea7a1 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>>  	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> +	struct folio_batch batch;
>>>  	struct folio *folio, *next;
>>>  	int split = 0;
>>>  
>>> @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  		ds_queue = &sc->memcg->deferred_split_queue;
>>>  #endif
>>>  
>>> +	folio_batch_init(&batch);
>>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>  	/* Take pin on all head pages to avoid freeing them under us */
>>>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>>  							_deferred_list) {
>>> -		if (folio_try_get(folio)) {
>>> -			list_move(&folio->_deferred_list, &list);
>>> -		} else {
>>> -			/* We lost race with folio_put() */
>>> -			list_del_init(&folio->_deferred_list);
>>> -			ds_queue->split_queue_len--;
>>> +		if (!folio_try_get(folio))
>>> +			continue;
>>> +		if (!folio_trylock(folio))
>>> +			continue;
>>> +		list_del_init(&folio->_deferred_list);
>>> +		if (folio_batch_add(&batch, folio) == 0) {
>>> +			--sc->nr_to_scan;
>>> +			break;
>>>  		}
>>>  		if (!--sc->nr_to_scan)
>>>  			break;
>>>  	}
>>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>  
>>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>>> -		if (!folio_trylock(folio))
>>> -			goto next;
>>> -		/* split_huge_page() removes page from list on success */
>>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>>>  		if (!split_folio(folio))
>>>  			split++;
>>>  		folio_unlock(folio);
>>> -next:
>>> -		folio_put(folio);
>>>  	}
>>>  
>>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> -	list_splice_tail(&list, &ds_queue->split_queue);
>>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>>> +		if (!folio_test_large(folio))
>>> +			continue;
>>> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>>> +	}
>>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>  
>>> +	folios_put(&batch);
>>> +
>>>  	/*
>>>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>>>  	 * This can happen if pages were freed under us.
>>
>> I've added this patch to my branch and tested (still without the patch that I fingered as the culprit originally, for now). Unfortuantely it is still blowing up at about the same rate, although it looks very different now. I've seen bad things twice. The first time was RCU stalls, but systemd had turned the log level down so no stack trace and I didn't manage to get any further information. The second time, this:
>>
>> [  338.519401] Unable to handle kernel paging request at virtual address fffc001b13a8c870
>> [  338.519402] Unable to handle kernel paging request at virtual address fffc001b13a8c870
>> [  338.519407] Mem abort info:
>> [  338.519407]   ESR = 0x0000000096000004
>> [  338.519408]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [  338.519588] Unable to handle kernel paging request at virtual address fffc001b13a8c870
>> [  338.519591] Mem abort info:
>> [  338.519592]   ESR = 0x0000000096000004
>> [  338.519593]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [  338.519594]   SET = 0, FnV = 0
>> [  338.519595]   EA = 0, S1PTW = 0
>> [  338.519596]   FSC = 0x04: level 0 translation fault
>> [  338.519597] Data abort info:
>> [  338.519597]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>> [  338.519598]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [  338.519599]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [  338.519600] [fffc001b13a8c870] address between user and kernel address ranges
>> [  338.519602] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>> [  338.519605] Modules linked in:
>> [  338.519607] CPU: 43 PID: 3234 Comm: usemem Not tainted 6.8.0-rc5-00465-g279cb41b481e-dirty #3
>> [  338.519610] Hardware name: linux,dummy-virt (DT)
>> [  338.519611] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  338.519613] pc : down_read_trylock+0x2c/0xd0
>> [  338.519618] lr : folio_lock_anon_vma_read+0x74/0x2c8
>> [  338.519623] sp : ffff800087f935c0
>> [  338.519623] x29: ffff800087f935c0 x28: 0000000000000000 x27: ffff800087f937e0
>> [  338.519626] x26: 0000000000000001 x25: ffff800087f937a8 x24: fffffc0007258180
>> [  338.519628] x23: ffff800087f936c8 x22: fffc001b13a8c870 x21: ffff0000f7d51d69
>> [  338.519630] x20: ffff0000f7d51d68 x19: fffffc0007258180 x18: 0000000000000000
>> [  338.519632] x17: 0000000000000001 x16: ffff0000c90ab458 x15: 0000000000000040
>> [  338.519634] x14: ffff0000c8c7b558 x13: 0000000000000228 x12: 000040f22f534640
>> [  338.519637] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080338b3c
>> [  338.519639] x8 : ffff800087f93618 x7 : 0000000000000000 x6 : ffff0000c9692f50
>> [  338.519641] x5 : ffff800087f936b0 x4 : 0000000000000001 x3 : ffff0000d70d9140
>> [  338.519643] x2 : 0000000000000001 x1 : fffc001b13a8c870 x0 : fffc001b13a8c870
>> [  338.519645] Call trace:
>> [  338.519646]  down_read_trylock+0x2c/0xd0
>> [  338.519648]  folio_lock_anon_vma_read+0x74/0x2c8
>> [  338.519650]  rmap_walk_anon+0x1d8/0x2c0
>> [  338.519652]  folio_referenced+0x1b4/0x1e0
>> [  338.519655]  shrink_folio_list+0x768/0x10c8
>> [  338.519658]  shrink_lruvec+0x5dc/0xb30
>> [  338.519660]  shrink_node+0x4d8/0x8b0
>> [  338.519662]  do_try_to_free_pages+0xe0/0x5a8
>> [  338.519665]  try_to_free_mem_cgroup_pages+0x128/0x2d0
>> [  338.519667]  try_charge_memcg+0x114/0x658
>> [  338.519671]  __mem_cgroup_charge+0x6c/0xd0
>> [  338.519672]  __handle_mm_fault+0x42c/0x1640
>> [  338.519675]  handle_mm_fault+0x70/0x290
>> [  338.519677]  do_page_fault+0xfc/0x4d8
>> [  338.519681]  do_translation_fault+0xa4/0xc0
>> [  338.519682]  do_mem_abort+0x4c/0xa8
>> [  338.519685]  el0_da+0x2c/0x78
>> [  338.519687]  el0t_64_sync_handler+0xb8/0x130
>> [  338.519689]  el0t_64_sync+0x190/0x198
>> [  338.519692] Code: aa0003e1 b9400862 11000442 b9000862 (f9400000) 
>> [  338.519693] ---[ end trace 0000000000000000 ]---
>>
>> The fault is when trying to do an atomic_long_read(&sem->count) here:
>>
>> struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
>> 					  struct rmap_walk_control *rwc)
>> {
>> 	struct anon_vma *anon_vma = NULL;
>> 	struct anon_vma *root_anon_vma;
>> 	unsigned long anon_mapping;
>>
>> retry:
>> 	rcu_read_lock();
>> 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>> 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>> 		goto out;
>> 	if (!folio_mapped(folio))
>> 		goto out;
>>
>> 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>> 	root_anon_vma = READ_ONCE(anon_vma->root);
>> 	if (down_read_trylock(&root_anon_vma->rwsem)) { <<<<<<<
>>
>> I guess we are still corrupting folios?
> 
> I guess so ...

I noticed commit dfa3df509576 ("mm: fix list corruption in put_pages_list")
turned up in mm-unstable today (after I sent the above). Although I haven't done
much of the exact testing that was previously causing oopses, I also haven't
seen any since I rebased onto today's mm-unstable. Could that fix be helping us?

> 
> The thought occurs that we don't need to take the folios off the list.
> I don't know that will fix anything, but this will fix your "running out
> of memory" problem -- I forgot to drop the reference if folio_trylock()
> failed.  

Ugh, how did I not spot that! So I guess that fits the hypothesis that the
original change is just increasing the race window and therefore we are leaking
more folios due to the failed trylock.

I'll give this a spin in the morning and report back.

> Of course, I can't call folio_put() inside the lock, so may
> as well move the trylock back to the second loop.
> 
> Again, compile-tessted only.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..4a2ab17f802d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	/* Take pin on all head pages to avoid freeing them under us */
> +	/* Take ref on all folios to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> -			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +		if (!folio_try_get(folio))
> +			continue;
> +		if (folio_batch_add(&batch, folio) == 0) {
> +			--sc->nr_to_scan;
> +			break;
>  		}
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +			continue;
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	folios_put(&batch);
>  
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-07 18:35                     ` Ryan Roberts
@ 2024-03-07 20:42                       ` Matthew Wilcox
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-07 20:42 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 06:35:16PM +0000, Ryan Roberts wrote:
> I noticed commit dfa3df509576 ("mm: fix list corruption in put_pages_list")
> turned up in mm-unstable today (after I sent the above). Although I haven't done
> much of the exact testing that was previously causing oopses, I also haven't
> seen any since I rebased onto today's mm-unstable. Could that fix be helping us?

I wish.  Wrong list (lru vs deferred), and the symptom of that crash was
an immediate crash, not a deferred one.  Although maybe with the
right/wrong debugging options ...

> > The thought occurs that we don't need to take the folios off the list.
> > I don't know that will fix anything, but this will fix your "running out
> > of memory" problem -- I forgot to drop the reference if folio_trylock()
> > failed.  
> 
> Ugh, how did I not spot that! So I guess that fits the hypothesis that the
> original change is just increasing the race window and therefore we are leaking
> more folios due to the failed trylock.

It doesn't _confirm_ it, but it certainly fits the theory!
Thanks for testing.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07 16:24                         ` Re: Ryan Roberts
@ 2024-03-07 23:02                           ` Matthew Wilcox
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-07 23:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 04:24:43PM +0000, Ryan Roberts wrote:
> > But if I run only with the deferred split fix and DO NOT revert the other
> > change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> > stalls where I can't even interact on the serial port. Sometimes (more usually)
> > everything just gets stuck trying to reclaim and allocate memory. And when I
> > kill the jobs, I still have barely any memory in the system - about 10% what I
> > would expect.

(for the benefit of anyone trying to follow along, this is now
understood; it was my missing folio_put() in the 'folio_trylock failed'
path)

> I notice that before the commit, large folios are uncharged with
> __mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().
> 
> The former has an upfront check:
> 
> 	if (!folio_memcg(folio))
> 		return;
> 
> I'm not exactly sure what that's checking but could the fact this is missing
> after the change cause things to go wonky?

Honestly, I think that's stale.  uncharge_folio() checks the same
thing very early on, so all it's actually saving is a test of the LRU
flag.

Looks like the need for it went away in 2017 with commit a9d5adeeb4b2c73c
which stopped using page->lru to gather the single page onto a
degenerate list.  I'll try to remember to submit a patch to delete
that check.

By the way, something we could try to see if the problem goes away is to
re-narrow the window that i widened.  ie something like this:

+++ b/mm/swap.c
@@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                        free_huge_folio(folio);
                        continue;
                }
+               if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+                       folio_undo_large_rmappable(folio);

                __page_cache_release(folio, &lruvec, &flags);




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-08  1:06                       ` Yin, Fengwei
  1 sibling, 0 replies; 73+ messages in thread
From: Yin, Fengwei @ 2024-03-08  1:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 10:05 PM, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
Removing folio from deferred_list in folio_put() also needs require
split_queue_lock. So my understanding is no deleting without hold
correct lock. local list iteration is impacted. But that's not the issue
Ryan hit here.

> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                  if (order > 0 && folio_test_large_rmappable(folio))
>>                          folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>          if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.
> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().
Oh. That's the key. Thanks a lot for detail explanation.

I thought there was a bug in other place, covered by
destroy_large_folio() but exposed by free_unref_folios()...


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
  2024-03-07 18:35                     ` Ryan Roberts
@ 2024-03-08 11:44                     ` Ryan Roberts
  2024-03-08 12:09                       ` Ryan Roberts
  2024-03-09  6:09                       ` Matthew Wilcox
  1 sibling, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-08 11:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

> The thought occurs that we don't need to take the folios off the list.
> I don't know that will fix anything, but this will fix your "running out
> of memory" problem -- I forgot to drop the reference if folio_trylock()
> failed.  Of course, I can't call folio_put() inside the lock, so may
> as well move the trylock back to the second loop.
> 
> Again, compile-tessted only.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..4a2ab17f802d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	/* Take pin on all head pages to avoid freeing them under us */
> +	/* Take ref on all folios to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> -			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +		if (!folio_try_get(folio))
> +			continue;
> +		if (folio_batch_add(&batch, folio) == 0) {
> +			--sc->nr_to_scan;
> +			break;
>  		}
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +			continue;
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	folios_put(&batch);
>  
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.


OK I've tested this; the good news is that I haven't seen any oopses or memory
leaks. The bad news is that it still takes an absolute age (hours) to complete
the same test that without "mm: Allow non-hugetlb large folios to be batch
processed" took a couple of mins. And during that time, the system is completely
unresponsive - serial terminal doesn't work - can't even break in with sysreq.
And sometimes I see RCU stall warnings.

Dumping all the CPU back traces with gdb, all the cores (except one) are
contending on the the deferred split lock.

A couple of thoughts:

 - Since we are now taking a maximum of 15 folios into a batch,
deferred_split_scan() is called much more often (in a tight loop from
do_shrink_slab()). Could it be that we are just trying to take the lock so much
more often now? I don't think it's quite that simple because we take the lock
for every single folio when adding it to the queue, so the dequeing cost should
still be a factor of 15 locks less.

- do_shrink_slab() might be calling deferred_split_scan() in a tight loop with
deferred_split_scan() returning 0 most of the time. If there are still folios on
the deferred split list but deferred_split_scan() was unable to lock any folios
then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling
it, essentially live locking. Has your patch changed the duration of the folio
being locked? I don't think so...

- Ahh, perhaps its as simple as your fix has removed the code that removed the
folio from the deferred split queue if it fails to get a reference? That could
mean we end up returning 0 instead of SHRINK_STOP too. I'll have play.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 11:44                     ` Ryan Roberts
@ 2024-03-08 12:09                       ` Ryan Roberts
  2024-03-08 14:21                         ` Ryan Roberts
  2024-03-08 15:33                         ` Matthew Wilcox
  2024-03-09  6:09                       ` Matthew Wilcox
  1 sibling, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-08 12:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 08/03/2024 11:44, Ryan Roberts wrote:
>> The thought occurs that we don't need to take the folios off the list.
>> I don't know that will fix anything, but this will fix your "running out
>> of memory" problem -- I forgot to drop the reference if folio_trylock()
>> failed.  Of course, I can't call folio_put() inside the lock, so may
>> as well move the trylock back to the second loop.
>>
>> Again, compile-tessted only.
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fd745bcc97ff..4a2ab17f802d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>  	unsigned long flags;
>> -	LIST_HEAD(list);
>> +	struct folio_batch batch;
>>  	struct folio *folio, *next;
>>  	int split = 0;
>>  
>> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  		ds_queue = &sc->memcg->deferred_split_queue;
>>  #endif
>>  
>> +	folio_batch_init(&batch);
>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	/* Take pin on all head pages to avoid freeing them under us */
>> +	/* Take ref on all folios to avoid freeing them under us */
>>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>  							_deferred_list) {
>> -		if (folio_try_get(folio)) {
>> -			list_move(&folio->_deferred_list, &list);
>> -		} else {
>> -			/* We lost race with folio_put() */
>> -			list_del_init(&folio->_deferred_list);
>> -			ds_queue->split_queue_len--;
>> +		if (!folio_try_get(folio))
>> +			continue;
>> +		if (folio_batch_add(&batch, folio) == 0) {
>> +			--sc->nr_to_scan;
>> +			break;
>>  		}
>>  		if (!--sc->nr_to_scan)
>>  			break;
>>  	}
>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  
>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>>  		if (!folio_trylock(folio))
>> -			goto next;
>> -		/* split_huge_page() removes page from list on success */
>> +			continue;
>>  		if (!split_folio(folio))
>>  			split++;
>>  		folio_unlock(folio);
>> -next:
>> -		folio_put(folio);
>>  	}
>>  
>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	list_splice_tail(&list, &ds_queue->split_queue);
>> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> +	folios_put(&batch);
>>  
>>  	/*
>>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
> 
> 
> OK I've tested this; the good news is that I haven't seen any oopses or memory
> leaks. The bad news is that it still takes an absolute age (hours) to complete
> the same test that without "mm: Allow non-hugetlb large folios to be batch
> processed" took a couple of mins. And during that time, the system is completely
> unresponsive - serial terminal doesn't work - can't even break in with sysreq.
> And sometimes I see RCU stall warnings.
> 
> Dumping all the CPU back traces with gdb, all the cores (except one) are
> contending on the the deferred split lock.
> 
> A couple of thoughts:
> 
>  - Since we are now taking a maximum of 15 folios into a batch,
> deferred_split_scan() is called much more often (in a tight loop from
> do_shrink_slab()). Could it be that we are just trying to take the lock so much
> more often now? I don't think it's quite that simple because we take the lock
> for every single folio when adding it to the queue, so the dequeing cost should
> still be a factor of 15 locks less.
> 
> - do_shrink_slab() might be calling deferred_split_scan() in a tight loop with
> deferred_split_scan() returning 0 most of the time. If there are still folios on
> the deferred split list but deferred_split_scan() was unable to lock any folios
> then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling
> it, essentially live locking. Has your patch changed the duration of the folio
> being locked? I don't think so...
> 
> - Ahh, perhaps its as simple as your fix has removed the code that removed the
> folio from the deferred split queue if it fails to get a reference? That could
> mean we end up returning 0 instead of SHRINK_STOP too. I'll have play.
> 

I tested the last idea by adding this back in:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d46897d7ea7f..50b07362923a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3327,8 +3327,12 @@ static unsigned long deferred_split_scan(struct shrinker
*shrink,
        /* Take ref on all folios to avoid freeing them under us */
        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                        _deferred_list) {
-               if (!folio_try_get(folio))
+               if (!folio_try_get(folio)) {
+                       /* We lost race with folio_put() */
+                       list_del_init(&folio->_deferred_list);
+                       ds_queue->split_queue_len--;
                        continue;
+               }
                if (folio_batch_add(&batch, folio) == 0) {
                        --sc->nr_to_scan;
                        break;

The test now gets further than where it was previously getting live-locked, but
I then get a new oops (this is just yesterday's mm-unstable with your fix v2 and
the above change):

[  247.788985] BUG: Bad page state in process usemem  pfn:ae58c2
[  247.789617] page: refcount:0 mapcount:0 mapping:00000000dc16b680 index:0x1
pfn:0xae58c2
[  247.790129] aops:0x0 ino:dead000000000122
[  247.790394] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
[  247.790821] page_type: 0xffffffff()
[  247.791052] raw: 0bfffc0000000000 0000000000000000 fffffc002a963090
fffffc002a963090
[  247.791546] raw: 0000000000000001 0000000000000000 00000000ffffffff
0000000000000000
[  247.792258] page dumped because: non-NULL mapping
[  247.792567] Modules linked in:
[  247.792772] CPU: 0 PID: 2052 Comm: usemem Not tainted
6.8.0-rc5-00456-g52fd6cd3bee5 #30
[  247.793300] Hardware name: linux,dummy-virt (DT)
[  247.793680] Call trace:
[  247.793894]  dump_backtrace+0x9c/0x100
[  247.794200]  show_stack+0x20/0x38
[  247.794460]  dump_stack_lvl+0x90/0xb0
[  247.794726]  dump_stack+0x18/0x28
[  247.794964]  bad_page+0x88/0x128
[  247.795196]  get_page_from_freelist+0xdc4/0x1280
[  247.795520]  __alloc_pages+0xe8/0x1038
[  247.795781]  alloc_pages_mpol+0x90/0x278
[  247.796059]  vma_alloc_folio+0x70/0xd0
[  247.796320]  __handle_mm_fault+0xc40/0x19a0
[  247.796610]  handle_mm_fault+0x7c/0x418
[  247.796908]  do_page_fault+0x100/0x690
[  247.797231]  do_translation_fault+0xb4/0xd0
[  247.797584]  do_mem_abort+0x4c/0xa8
[  247.797874]  el0_da+0x54/0xb8
[  247.798123]  el0t_64_sync_handler+0xe4/0x158
[  247.798473]  el0t_64_sync+0x190/0x198
[  247.815597] Disabling lock debugging due to kernel taint

And then into RCU stalls after that. I have seen a similar non-NULL mapping oops
yesterday. But with the deferred split fix in place, I can now see this reliably.

My sense is that the first deferred split issue is now fully resolved once the
extra code above is reinserted, but we still have a second problem. Thoughts?

Perhaps I can bisect this given it seems pretty reproducible.

Thanks,
Ryan



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 12:09                       ` Ryan Roberts
@ 2024-03-08 14:21                         ` Ryan Roberts
  2024-03-08 15:11                           ` Matthew Wilcox
  2024-03-08 15:33                         ` Matthew Wilcox
  1 sibling, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-08 14:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 08/03/2024 12:09, Ryan Roberts wrote:
> On 08/03/2024 11:44, Ryan Roberts wrote:
>>> The thought occurs that we don't need to take the folios off the list.
>>> I don't know that will fix anything, but this will fix your "running out
>>> of memory" problem -- I forgot to drop the reference if folio_trylock()
>>> failed.  Of course, I can't call folio_put() inside the lock, so may
>>> as well move the trylock back to the second loop.
>>>
>>> Again, compile-tessted only.
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index fd745bcc97ff..4a2ab17f802d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>>  	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> +	struct folio_batch batch;
>>>  	struct folio *folio, *next;
>>>  	int split = 0;
>>>  
>>> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  		ds_queue = &sc->memcg->deferred_split_queue;
>>>  #endif
>>>  
>>> +	folio_batch_init(&batch);
>>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> -	/* Take pin on all head pages to avoid freeing them under us */
>>> +	/* Take ref on all folios to avoid freeing them under us */
>>>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>>  							_deferred_list) {
>>> -		if (folio_try_get(folio)) {
>>> -			list_move(&folio->_deferred_list, &list);
>>> -		} else {
>>> -			/* We lost race with folio_put() */
>>> -			list_del_init(&folio->_deferred_list);
>>> -			ds_queue->split_queue_len--;
>>> +		if (!folio_try_get(folio))
>>> +			continue;
>>> +		if (folio_batch_add(&batch, folio) == 0) {
>>> +			--sc->nr_to_scan;
>>> +			break;
>>>  		}
>>>  		if (!--sc->nr_to_scan)
>>>  			break;
>>>  	}
>>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>  
>>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>>>  		if (!folio_trylock(folio))
>>> -			goto next;
>>> -		/* split_huge_page() removes page from list on success */
>>> +			continue;
>>>  		if (!split_folio(folio))
>>>  			split++;
>>>  		folio_unlock(folio);
>>> -next:
>>> -		folio_put(folio);
>>>  	}
>>>  
>>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> -	list_splice_tail(&list, &ds_queue->split_queue);
>>> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>> +	folios_put(&batch);
>>>  
>>>  	/*
>>>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>>
>>
>> OK I've tested this; the good news is that I haven't seen any oopses or memory
>> leaks. The bad news is that it still takes an absolute age (hours) to complete
>> the same test that without "mm: Allow non-hugetlb large folios to be batch
>> processed" took a couple of mins. And during that time, the system is completely
>> unresponsive - serial terminal doesn't work - can't even break in with sysreq.
>> And sometimes I see RCU stall warnings.
>>
>> Dumping all the CPU back traces with gdb, all the cores (except one) are
>> contending on the the deferred split lock.
>>
>> A couple of thoughts:
>>
>>  - Since we are now taking a maximum of 15 folios into a batch,
>> deferred_split_scan() is called much more often (in a tight loop from
>> do_shrink_slab()). Could it be that we are just trying to take the lock so much
>> more often now? I don't think it's quite that simple because we take the lock
>> for every single folio when adding it to the queue, so the dequeing cost should
>> still be a factor of 15 locks less.
>>
>> - do_shrink_slab() might be calling deferred_split_scan() in a tight loop with
>> deferred_split_scan() returning 0 most of the time. If there are still folios on
>> the deferred split list but deferred_split_scan() was unable to lock any folios
>> then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling
>> it, essentially live locking. Has your patch changed the duration of the folio
>> being locked? I don't think so...
>>
>> - Ahh, perhaps its as simple as your fix has removed the code that removed the
>> folio from the deferred split queue if it fails to get a reference? That could
>> mean we end up returning 0 instead of SHRINK_STOP too. I'll have play.
>>
> 
> I tested the last idea by adding this back in:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d46897d7ea7f..50b07362923a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3327,8 +3327,12 @@ static unsigned long deferred_split_scan(struct shrinker
> *shrink,
>         /* Take ref on all folios to avoid freeing them under us */
>         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>                                                         _deferred_list) {
> -               if (!folio_try_get(folio))
> +               if (!folio_try_get(folio)) {
> +                       /* We lost race with folio_put() */
> +                       list_del_init(&folio->_deferred_list);
> +                       ds_queue->split_queue_len--;
>                         continue;
> +               }
>                 if (folio_batch_add(&batch, folio) == 0) {
>                         --sc->nr_to_scan;
>                         break;
> 
> The test now gets further than where it was previously getting live-locked, but
> I then get a new oops (this is just yesterday's mm-unstable with your fix v2 and
> the above change):
> 
> [  247.788985] BUG: Bad page state in process usemem  pfn:ae58c2
> [  247.789617] page: refcount:0 mapcount:0 mapping:00000000dc16b680 index:0x1
> pfn:0xae58c2
> [  247.790129] aops:0x0 ino:dead000000000122
> [  247.790394] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> [  247.790821] page_type: 0xffffffff()
> [  247.791052] raw: 0bfffc0000000000 0000000000000000 fffffc002a963090
> fffffc002a963090
> [  247.791546] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  247.792258] page dumped because: non-NULL mapping
> [  247.792567] Modules linked in:
> [  247.792772] CPU: 0 PID: 2052 Comm: usemem Not tainted
> 6.8.0-rc5-00456-g52fd6cd3bee5 #30
> [  247.793300] Hardware name: linux,dummy-virt (DT)
> [  247.793680] Call trace:
> [  247.793894]  dump_backtrace+0x9c/0x100
> [  247.794200]  show_stack+0x20/0x38
> [  247.794460]  dump_stack_lvl+0x90/0xb0
> [  247.794726]  dump_stack+0x18/0x28
> [  247.794964]  bad_page+0x88/0x128
> [  247.795196]  get_page_from_freelist+0xdc4/0x1280
> [  247.795520]  __alloc_pages+0xe8/0x1038
> [  247.795781]  alloc_pages_mpol+0x90/0x278
> [  247.796059]  vma_alloc_folio+0x70/0xd0
> [  247.796320]  __handle_mm_fault+0xc40/0x19a0
> [  247.796610]  handle_mm_fault+0x7c/0x418
> [  247.796908]  do_page_fault+0x100/0x690
> [  247.797231]  do_translation_fault+0xb4/0xd0
> [  247.797584]  do_mem_abort+0x4c/0xa8
> [  247.797874]  el0_da+0x54/0xb8
> [  247.798123]  el0t_64_sync_handler+0xe4/0x158
> [  247.798473]  el0t_64_sync+0x190/0x198
> [  247.815597] Disabling lock debugging due to kernel taint
> 
> And then into RCU stalls after that. I have seen a similar non-NULL mapping oops
> yesterday. But with the deferred split fix in place, I can now see this reliably.
> 
> My sense is that the first deferred split issue is now fully resolved once the
> extra code above is reinserted, but we still have a second problem. Thoughts?
> 
> Perhaps I can bisect this given it seems pretty reproducible.

OK few more bits of information:

bisect lands back on the same patch it always does; "mm: Allow non-hugetlb large
folios to be batch processed". Without this change, I can't reproduce the above
oops.

With that change present, if I "re-narrow" the window as you suggested, I also
can't reproduce the problem.

As far as I can tell, mapping is zeroed when the page is freed, and the same
page checks are run at at that point too. So mapping must be written to while
the page is in the buddy? Perhaps something thinks its still a tail page during
split, but the buddy thinks its been freed?

Also the mapping value 00000000dc16b680 is not a valid kernel address, I don't
think. So surprised that get_kernel_nofault(host, &mapping->host) works.


> 
> Thanks,
> Ryan
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 14:21                         ` Ryan Roberts
@ 2024-03-08 15:11                           ` Matthew Wilcox
  2024-03-08 16:03                             ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-08 15:11 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 02:21:30PM +0000, Ryan Roberts wrote:
> > [  247.788985] BUG: Bad page state in process usemem  pfn:ae58c2
> > [  247.789617] page: refcount:0 mapcount:0 mapping:00000000dc16b680 index:0x1
> > pfn:0xae58c2
> > [  247.790129] aops:0x0 ino:dead000000000122
> > [  247.790394] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> > [  247.790821] page_type: 0xffffffff()
> > [  247.791052] raw: 0bfffc0000000000 0000000000000000 fffffc002a963090
> > fffffc002a963090
> > [  247.791546] raw: 0000000000000001 0000000000000000 00000000ffffffff
> > 0000000000000000
> > [  247.792258] page dumped because: non-NULL mapping
> > [  247.792567] Modules linked in:
> > [  247.792772] CPU: 0 PID: 2052 Comm: usemem Not tainted
> > 6.8.0-rc5-00456-g52fd6cd3bee5 #30
> > [  247.793300] Hardware name: linux,dummy-virt (DT)
> > [  247.793680] Call trace:
> > [  247.793894]  dump_backtrace+0x9c/0x100
> > [  247.794200]  show_stack+0x20/0x38
> > [  247.794460]  dump_stack_lvl+0x90/0xb0
> > [  247.794726]  dump_stack+0x18/0x28
> > [  247.794964]  bad_page+0x88/0x128
> > [  247.795196]  get_page_from_freelist+0xdc4/0x1280
> > [  247.795520]  __alloc_pages+0xe8/0x1038
...
> > My sense is that the first deferred split issue is now fully resolved once the
> > extra code above is reinserted, but we still have a second problem. Thoughts?

That seems likely ;-(  It doesn't fit the same pattern as the ones we've
been looking at.

> bisect lands back on the same patch it always does; "mm: Allow non-hugetlb large
> folios to be batch processed". Without this change, I can't reproduce the above
> oops.
> 
> With that change present, if I "re-narrow" the window as you suggested, I also
> can't reproduce the problem.

Ah, a pre-existing condition ;-(

> As far as I can tell, mapping is zeroed when the page is freed, and the same
> page checks are run at at that point too. So mapping must be written to while
> the page is in the buddy? Perhaps something thinks its still a tail page during
> split, but the buddy thinks its been freed?

I'll stare at those codepaths; see if I can see anything.

> Also the mapping value 00000000dc16b680 is not a valid kernel address, I don't
> think. So surprised that get_kernel_nofault(host, &mapping->host) works.

Ah, you've been caught by hashed kernel pointers.  You can tell because
the top 32 bits are 0.  The real pointer is fffffc002a963090 (see the
raw dump).

Actually, I have a clue!  The third and fourth word have the same value.
That's indicative of an empty list_head.  And if this were LRU, that would
be the second and third word.  And the PFN is congruent to 2 modulo 4.
So this is the second tail page, and that's an empty deferred_list.
So how do we init a list_head after a folio gets freed?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 12:09                       ` Ryan Roberts
  2024-03-08 14:21                         ` Ryan Roberts
@ 2024-03-08 15:33                         ` Matthew Wilcox
  1 sibling, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-08 15:33 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 12:09:38PM +0000, Ryan Roberts wrote:
> On 08/03/2024 11:44, Ryan Roberts wrote:
> > Dumping all the CPU back traces with gdb, all the cores (except one) are
> > contending on the the deferred split lock.
> > 
> > A couple of thoughts:
> > 
> >  - Since we are now taking a maximum of 15 folios into a batch,
> > deferred_split_scan() is called much more often (in a tight loop from
> > do_shrink_slab()). Could it be that we are just trying to take the lock so much
> > more often now? I don't think it's quite that simple because we take the lock
> > for every single folio when adding it to the queue, so the dequeing cost should
> > still be a factor of 15 locks less.
> > 
> > - do_shrink_slab() might be calling deferred_split_scan() in a tight loop with
> > deferred_split_scan() returning 0 most of the time. If there are still folios on
> > the deferred split list but deferred_split_scan() was unable to lock any folios
> > then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling
> > it, essentially live locking. Has your patch changed the duration of the folio
> > being locked? I don't think so...
> > 
> > - Ahh, perhaps its as simple as your fix has removed the code that removed the
> > folio from the deferred split queue if it fails to get a reference? That could
> > mean we end up returning 0 instead of SHRINK_STOP too. I'll have play.
> > 
> 
> I tested the last idea by adding this back in:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d46897d7ea7f..50b07362923a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3327,8 +3327,12 @@ static unsigned long deferred_split_scan(struct shrinker
> *shrink,
>         /* Take ref on all folios to avoid freeing them under us */
>         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>                                                         _deferred_list) {
> -               if (!folio_try_get(folio))
> +               if (!folio_try_get(folio)) {
> +                       /* We lost race with folio_put() */
> +                       list_del_init(&folio->_deferred_list);
> +                       ds_queue->split_queue_len--;
>                         continue;
> +               }
>                 if (folio_batch_add(&batch, folio) == 0) {
>                         --sc->nr_to_scan;
>                         break;
> 
> The test now gets further than where it was previously getting live-locked, but

If the deferred_split_lock contention comes back, we can fix
split_huge_page_to_list() to only take the lock if the page is on the
list.  Right now, it takes it unconditionally which could be avoided.
I'm not going to send a patch just yet to avoid confusion.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 15:11                           ` Matthew Wilcox
@ 2024-03-08 16:03                             ` Matthew Wilcox
  2024-03-08 17:13                               ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-08 16:03 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 03:11:35PM +0000, Matthew Wilcox wrote:
> Actually, I have a clue!  The third and fourth word have the same value.
> That's indicative of an empty list_head.  And if this were LRU, that would
> be the second and third word.  And the PFN is congruent to 2 modulo 4.
> So this is the second tail page, and that's an empty deferred_list.
> So how do we init a list_head after a folio gets freed?

We should probably add this patch anyway, because why wouldn't we want
to check this.  Maybe it'll catch your offender?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 025ad1a7df7b..fc9c7ca24c4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/*
-		 * the second tail page: ->mapping is
-		 * deferred_list.next -- ignore value.
+		 * the second tail page: ->mapping is deferred_list.next
 		 */
+		if (unlikely(!list_empty(&folio->_deferred_list))) {
+			bad_page(page, "still on deferred list");
+			goto out;
+		}
 		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {

(thinking about it, this may not be right for all tail pages; will Slab
stumble over this?  It doesn't seem to stumble on _entire_mapcount, but
then we always initialise _entire_mapcount for all compound pages
and we don't initialise _deferred_list for slab ... gah)


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 16:03                             ` Matthew Wilcox
@ 2024-03-08 17:13                               ` Ryan Roberts
  2024-03-08 18:09                                 ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-08 17:13 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

+ DavidH

On 08/03/2024 16:03, Matthew Wilcox wrote:
> On Fri, Mar 08, 2024 at 03:11:35PM +0000, Matthew Wilcox wrote:
>> Actually, I have a clue!  The third and fourth word have the same value.
>> That's indicative of an empty list_head.  And if this were LRU, that would
>> be the second and third word.  And the PFN is congruent to 2 modulo 4.
>> So this is the second tail page, and that's an empty deferred_list.
>> So how do we init a list_head after a folio gets freed?
> 
> We should probably add this patch anyway, because why wouldn't we want
> to check this.  Maybe it'll catch your offender?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 025ad1a7df7b..fc9c7ca24c4c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  		break;
>  	case 2:
>  		/*
> -		 * the second tail page: ->mapping is
> -		 * deferred_list.next -- ignore value.
> +		 * the second tail page: ->mapping is deferred_list.next
>  		 */
> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> +			bad_page(page, "still on deferred list");
> +			goto out;
> +		}
>  		break;
>  	default:
>  		if (page->mapping != TAIL_MAPPING) {
> 
> (thinking about it, this may not be right for all tail pages; will Slab
> stumble over this?  It doesn't seem to stumble on _entire_mapcount, but
> then we always initialise _entire_mapcount for all compound pages
> and we don't initialise _deferred_list for slab ... gah)

Yeah I'm getting a huge number of hits for this check. Most either have kfree() or free_slab() or page_to_skb() (networking code?) in the stack. Ideally need to filter on anon pages only, but presumably we have already ditched that info? Actually looks like the head page hasn't been nuked yet so should be able to test the low bit of mapping... let me have a play.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 17:13                               ` Ryan Roberts
@ 2024-03-08 18:09                                 ` Ryan Roberts
  2024-03-08 18:18                                   ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-08 18:09 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 08/03/2024 17:13, Ryan Roberts wrote:
> + DavidH
> 
> On 08/03/2024 16:03, Matthew Wilcox wrote:
>> On Fri, Mar 08, 2024 at 03:11:35PM +0000, Matthew Wilcox wrote:
>>> Actually, I have a clue!  The third and fourth word have the same value.
>>> That's indicative of an empty list_head.  And if this were LRU, that would
>>> be the second and third word.  And the PFN is congruent to 2 modulo 4.
>>> So this is the second tail page, and that's an empty deferred_list.
>>> So how do we init a list_head after a folio gets freed?
>>
>> We should probably add this patch anyway, because why wouldn't we want
>> to check this.  Maybe it'll catch your offender?
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 025ad1a7df7b..fc9c7ca24c4c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>>  		break;
>>  	case 2:
>>  		/*
>> -		 * the second tail page: ->mapping is
>> -		 * deferred_list.next -- ignore value.
>> +		 * the second tail page: ->mapping is deferred_list.next
>>  		 */
>> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
>> +			bad_page(page, "still on deferred list");
>> +			goto out;
>> +		}
>>  		break;
>>  	default:
>>  		if (page->mapping != TAIL_MAPPING) {
>>
>> (thinking about it, this may not be right for all tail pages; will Slab
>> stumble over this?  It doesn't seem to stumble on _entire_mapcount, but
>> then we always initialise _entire_mapcount for all compound pages
>> and we don't initialise _deferred_list for slab ... gah)
> 
> Yeah I'm getting a huge number of hits for this check. Most either have kfree() or free_slab() or page_to_skb() (networking code?) in the stack. Ideally need to filter on anon pages only, but presumably we have already ditched that info? Actually looks like the head page hasn't been nuked yet so should be able to test the low bit of mapping... let me have a play.

I think the world is trying to tell me "its Friday night. Stop". I can no longer
reproduce the non-NULL mapping oops that I was able to hit reliably this morning.

I do have this one though:

[  197.332914] Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000000
[  197.334250] Mem abort info:
[  197.334476]   ESR = 0x0000000096000044
[  197.334759]   EC = 0x25: DABT (current EL), IL = 32 bits
[  197.335161]   SET = 0, FnV = 0
[  197.335393]   EA = 0, S1PTW = 0
[  197.335622]   FSC = 0x04: level 0 translation fault
[  197.335985] Data abort info:
[  197.336201]   ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[  197.336606]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[  197.336998]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  197.337424] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000215dc0000
[  197.337927] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  197.338585] Internal error: Oops: 0000000096000044 [#1] PREEMPT SMP
[  197.339058] Modules linked in:
[  197.339296] CPU: 61 PID: 2369 Comm: usemem Not tainted
6.8.0-rc5-00392-g827ce916aa61 #38
[  197.339920] Hardware name: linux,dummy-virt (DT)
[  197.340273] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  197.340790] pc : deferred_split_scan+0x210/0x260
[  197.341154] lr : deferred_split_scan+0x70/0x260
[  197.341792] sp : ffff80008b453770
[  197.342050] x29: ffff80008b453770 x28: 00000000000000f7 x27: ffff80008b453988
[  197.342618] x26: ffff0000c260e540 x25: 0000000000000080 x24: ffff800081f0fe38
[  197.343170] x23: 0000000000000000 x22: 00000000000000f8 x21: ffff80008b453988
[  197.343703] x20: ffff0000ca897bd8 x19: ffff0000ca897b98 x18: 0000000000000000
[  197.344245] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000041557f9
[  197.344783] x14: 00000000041557f8 x13: 00000000041557f9 x12: 0000000000000000
[  197.345343] x11: 0000000000000040 x10: ffff800083cfed48 x9 : ffff80008b4537c0
[  197.345895] x8 : ffff800083cb2d10 x7 : 0000000000001b48 x6 : fffffc001a4a9090
[  197.346458] x5 : 0000000000000000 x4 : fffffc001a4a9090 x3 : fffffc001a4a9000
[  197.346994] x2 : fffffc001a4a9000 x1 : 0000000000000000 x0 : 0000000000000000
[  197.347534] Call trace:
[  197.347729]  deferred_split_scan+0x210/0x260
[  197.348069]  do_shrink_slab+0x184/0x750
[  197.348377]  shrink_slab+0x4d4/0x9c0
[  197.348646]  shrink_node+0x214/0x860
[  197.348923]  do_try_to_free_pages+0xd0/0x560
[  197.349257]  try_to_free_mem_cgroup_pages+0x14c/0x330
[  197.349641]  try_charge_memcg+0x1cc/0x788
[  197.349957]  __mem_cgroup_charge+0x6c/0xd0
[  197.350282]  __handle_mm_fault+0x1000/0x1a28
[  197.350624]  handle_mm_fault+0x7c/0x418
[  197.350933]  do_page_fault+0x100/0x690
[  197.351232]  do_translation_fault+0xb4/0xd0
[  197.351564]  do_mem_abort+0x4c/0xa8
[  197.351841]  el0_da+0x54/0xb8
[  197.352087]  el0t_64_sync_handler+0xe4/0x158
[  197.352432]  el0t_64_sync+0x190/0x198
[  197.352718] Code: 2a0503e6 35fff4a6 a9491446 f90004c5 (f90000a6)
[  197.353204] ---[ end trace 0000000000000000 ]---


deferred_split_scan+0x210/0x260 is the code that I added back:

if (!folio_try_get(folio)) {
	/* We lost race with folio_put() */
	list_del_init(&folio->_deferred_list); <<<< HERE
	ds_queue->split_queue_len--;
	continue;
}

We have the spinlock here so that really should not be happening. So does that
mean the list is being manipulated outside of the lock somewhere? Or maybe its
mapping (actually one of the deferred_list pointers being cleared by the buddy?
I dunno... give up. Will resume on Monday. Have a good weekend.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 18:09                                 ` Ryan Roberts
@ 2024-03-08 18:18                                   ` Matthew Wilcox
  2024-03-09  4:34                                     ` Andrew Morton
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-08 18:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 06:09:25PM +0000, Ryan Roberts wrote:
> I think the world is trying to tell me "its Friday night. Stop". I can no longer
> reproduce the non-NULL mapping oops that I was able to hit reliably this morning.

HEISENBUG!

> I do have this one though:
> 
> [  197.332914] Unable to handle kernel NULL pointer dereference at virtual
> address 0000000000000000
> [  197.340790] pc : deferred_split_scan+0x210/0x260
> [  197.341154] lr : deferred_split_scan+0x70/0x260
> [  197.347534] Call trace:
> [  197.347729]  deferred_split_scan+0x210/0x260
> [  197.348069]  do_shrink_slab+0x184/0x750
> 
> 
> deferred_split_scan+0x210/0x260 is the code that I added back:
> 
> if (!folio_try_get(folio)) {
> 	/* We lost race with folio_put() */
> 	list_del_init(&folio->_deferred_list); <<<< HERE
> 	ds_queue->split_queue_len--;
> 	continue;
> }
> 
> We have the spinlock here so that really should not be happening. So does that
> mean the list is being manipulated outside of the lock somewhere? Or maybe its
> mapping (actually one of the deferred_list pointers being cleared by the buddy?
> I dunno... give up. Will resume on Monday. Have a good weekend.

This is actually congruent with a new theory I have which is that
somewhere/somehow we're freeing the page without taking it off the
deferred list.  I don't see such a path, but if it does exist, we could
absolutely corrupt the deferred_list in this way.  Just working on a
patch to make my detection patch reliable ...

You have a good weekend too!


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 18:18                                   ` Matthew Wilcox
@ 2024-03-09  4:34                                     ` Andrew Morton
  2024-03-09  4:52                                       ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2024-03-09  4:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, David Hildenbrand, Zi Yan, linux-mm, Yang Shi, Huang Ying


We seem to be coming down to the wire on this one - Linus might release
6.8 this weekend.

Will simply dropping "mm: allow non-hugetlb large folios to be batch
processed" from mm-stable get us out of trouble?

Thanks.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  4:34                                     ` Andrew Morton
@ 2024-03-09  4:52                                       ` Matthew Wilcox
  2024-03-09  8:05                                         ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-09  4:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ryan Roberts, David Hildenbrand, Zi Yan, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 08:34:15PM -0800, Andrew Morton wrote:
> 
> We seem to be coming down to the wire on this one - Linus might release
> 6.8 this weekend.
> 
> Will simply dropping "mm: allow non-hugetlb large folios to be batch
> processed" from mm-stable get us out of trouble?

We can add a fix patch which re-narrows the race to the point where it's
no longer observable.  Obviously we need to figure out what the real
problem is, but we could be going back a long way.  We've definitely
found two bugs in the process of investigating the problem (of arguable
import; the migration one merely wastes memory temporarily and it's not
entirely clear that the wrong-lock problem definitely causes a crash)

diff --git a/mm/swap.c b/mm/swap.c
index 6b697d33fa5b..7b1d3144391b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			free_huge_folio(folio);
 			continue;
 		}
+		if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+			folio_undo_large_rmappable(folio);
 
 		__page_cache_release(folio, &lruvec, &flags);
 


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-08 11:44                     ` Ryan Roberts
  2024-03-08 12:09                       ` Ryan Roberts
@ 2024-03-09  6:09                       ` Matthew Wilcox
  2024-03-09  7:59                         ` Ryan Roberts
  1 sibling, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-09  6:09 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Fri, Mar 08, 2024 at 11:44:35AM +0000, Ryan Roberts wrote:
> > The thought occurs that we don't need to take the folios off the list.
> > I don't know that will fix anything, but this will fix your "running out
> > of memory" problem -- I forgot to drop the reference if folio_trylock()
> > failed.  Of course, I can't call folio_put() inside the lock, so may
> > as well move the trylock back to the second loop.

I think this was a bad thought ...

> Dumping all the CPU back traces with gdb, all the cores (except one) are
> contending on the the deferred split lock.

I'm pretty sure that we can call the shrinker on multiple CPUs at the
same time (can you confirm from the backtrace?)

        struct pglist_data *pgdata = NODE_DATA(sc->nid);
        struct deferred_split *ds_queue = &pgdata->deferred_split_queue;

so if two CPUs try to shrink the same node, they're going to try to
process the same set of folios.  Which means the split will keep failing
because each of them will have a refcount on the folio, and ... yeah.

If so, we need to take the folios off the list (or otherwise mark them)
so that they can't be processed by more than one CPU at a time.  And
that leads me to this patch (yes, folio_prep_large_rmappable() is
now vestigial, but removing it increases the churn a bit much for this
stage of debugging)

This time I've boot-tested it.  I'm running my usual test-suite against
it now with little expectation that it will trigger.  If I have time
I'll try to recreate your setup.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..2ca033a6c3d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
 {
 	if (!folio || !folio_test_large(folio))
 		return;
-	if (folio_order(folio) > 1)
-		INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_large_rmappable(folio);
 }
 
@@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
+	struct folio_batch batch;
 	struct folio *folio, *next;
 	int split = 0;
 
@@ -3321,36 +3319,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&batch);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	/* Take pin on all head pages to avoid freeing them under us */
+	/* Take ref on all folios to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
-		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
+		list_del_init(&folio->_deferred_list);
+		sc->nr_to_scan--;
+		if (!folio_try_get(folio)) {
 			/* We lost race with folio_put() */
-			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
+		} else if (folio_batch_add(&batch, folio) == 0) {
+			break;
 		}
-		if (!--sc->nr_to_scan)
+		if (!sc->nr_to_scan)
 			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	while ((folio = folio_batch_next(&batch)) != NULL) {
 		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
+			continue;
 		if (!split_folio(folio))
 			split++;
 		folio_unlock(folio);
-next:
-		folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	while ((folio = folio_batch_next(&batch)) != NULL) {
+		if (!folio_test_large(folio))
+			continue;
+		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	folios_put(&batch);
 
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
diff --git a/mm/internal.h b/mm/internal.h
index 1dfdc3bde1b0..14c21d06f233 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
+	if (order > 1)
+		INIT_LIST_HEAD(&folio->_deferred_list);
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 025ad1a7df7b..fc9c7ca24c4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/*
-		 * the second tail page: ->mapping is
-		 * deferred_list.next -- ignore value.
+		 * the second tail page: ->mapping is deferred_list.next
 		 */
+		if (unlikely(!list_empty(&folio->_deferred_list))) {
+			bad_page(page, "still on deferred list");
+			goto out;
+		}
 		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  6:09                       ` Matthew Wilcox
@ 2024-03-09  7:59                         ` Ryan Roberts
  2024-03-09  8:18                           ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-09  7:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 09/03/2024 06:09, Matthew Wilcox wrote:
> On Fri, Mar 08, 2024 at 11:44:35AM +0000, Ryan Roberts wrote:
>>> The thought occurs that we don't need to take the folios off the list.
>>> I don't know that will fix anything, but this will fix your "running out
>>> of memory" problem -- I forgot to drop the reference if folio_trylock()
>>> failed.  Of course, I can't call folio_put() inside the lock, so may
>>> as well move the trylock back to the second loop.
> 
> I think this was a bad thought ...

The not-taking-folios-off-the-list thought? Yes, agreed.

> 
>> Dumping all the CPU back traces with gdb, all the cores (except one) are
>> contending on the the deferred split lock.
> 
> I'm pretty sure that we can call the shrinker on multiple CPUs at the
> same time (can you confirm from the backtrace?)

Yes, the vast majority of the CPUs were in deferred_split_scan() waiting for the
split_queue_lock.

> 
>         struct pglist_data *pgdata = NODE_DATA(sc->nid);
>         struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
> 
> so if two CPUs try to shrink the same node, they're going to try to
> process the same set of folios.  Which means the split will keep failing
> because each of them will have a refcount on the folio, and ... yeah.

Ahh, ouch. So this probably explains why things started going slow for me again
last night.

> 
> If so, we need to take the folios off the list (or otherwise mark them)
> so that they can't be processed by more than one CPU at a time.  And
> that leads me to this patch (yes, folio_prep_large_rmappable() is
> now vestigial, but removing it increases the churn a bit much for this
> stage of debugging)

Looks sensible on first review. I'll do some testing now to see if I can
re-triger the non-NULL mapping issue. Will get back to you in the next couple of
hours.

> 
> This time I've boot-tested it.  I'm running my usual test-suite against
> it now with little expectation that it will trigger.  If I have time
> I'll try to recreate your setup.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..2ca033a6c3d8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
>  {
>  	if (!folio || !folio_test_large(folio))
>  		return;
> -	if (folio_order(folio) > 1)
> -		INIT_LIST_HEAD(&folio->_deferred_list);
>  	folio_set_large_rmappable(folio);
>  }
>  
> @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,36 +3319,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	/* Take pin on all head pages to avoid freeing them under us */
> +	/* Take ref on all folios to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> +		list_del_init(&folio->_deferred_list);
> +		sc->nr_to_scan--;
> +		if (!folio_try_get(folio)) {
>  			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
>  			ds_queue->split_queue_len--;
> +		} else if (folio_batch_add(&batch, folio) == 0) {
> +			break;
>  		}
> -		if (!--sc->nr_to_scan)
> +		if (!sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +			continue;
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
> +		if (!folio_test_large(folio))
> +			continue;
> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> +	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	folios_put(&batch);
>  
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
> diff --git a/mm/internal.h b/mm/internal.h
> index 1dfdc3bde1b0..14c21d06f233 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  	atomic_set(&folio->_entire_mapcount, -1);
>  	atomic_set(&folio->_nr_pages_mapped, 0);
>  	atomic_set(&folio->_pincount, 0);
> +	if (order > 1)
> +		INIT_LIST_HEAD(&folio->_deferred_list);
>  }
>  
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 025ad1a7df7b..fc9c7ca24c4c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  		break;
>  	case 2:
>  		/*
> -		 * the second tail page: ->mapping is
> -		 * deferred_list.next -- ignore value.
> +		 * the second tail page: ->mapping is deferred_list.next
>  		 */
> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> +			bad_page(page, "still on deferred list");
> +			goto out;
> +		}
>  		break;
>  	default:
>  		if (page->mapping != TAIL_MAPPING) {
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  4:52                                       ` Matthew Wilcox
@ 2024-03-09  8:05                                         ` Ryan Roberts
  2024-03-09 12:33                                           ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-09  8:05 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: David Hildenbrand, Zi Yan, linux-mm, Yang Shi, Huang Ying

On 09/03/2024 04:52, Matthew Wilcox wrote:
> On Fri, Mar 08, 2024 at 08:34:15PM -0800, Andrew Morton wrote:
>>
>> We seem to be coming down to the wire on this one - Linus might release
>> 6.8 this weekend.
>>
>> Will simply dropping "mm: allow non-hugetlb large folios to be batch
>> processed" from mm-stable get us out of trouble?
> 
> We can add a fix patch which re-narrows the race to the point where it's
> no longer observable.  Obviously we need to figure out what the real
> problem is, but we could be going back a long way.  We've definitely
> found two bugs in the process of investigating the problem (of arguable
> import; the migration one merely wastes memory temporarily and it's not
> entirely clear that the wrong-lock problem definitely causes a crash)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 6b697d33fa5b..7b1d3144391b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>  			free_huge_folio(folio);
>  			continue;
>  		}
> +		if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> +			folio_undo_large_rmappable(folio);
>  
>  		__page_cache_release(folio, &lruvec, &flags);
>  

I agree this is likely to re-hide the problems. But I haven't actually tested it
on it's own without the other fixes. I'll do some more testing with your latest
patch and if that doesn't lead anywhere, I'll test with this on its own to check
that I can no longer reproduce the crashes. If it hides them, I think this is
the best short-term solution we have right now.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  7:59                         ` Ryan Roberts
@ 2024-03-09  8:18                           ` Ryan Roberts
  2024-03-09  9:38                             ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-09  8:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 09/03/2024 07:59, Ryan Roberts wrote:
> On 09/03/2024 06:09, Matthew Wilcox wrote:
>> On Fri, Mar 08, 2024 at 11:44:35AM +0000, Ryan Roberts wrote:
>>>> The thought occurs that we don't need to take the folios off the list.
>>>> I don't know that will fix anything, but this will fix your "running out
>>>> of memory" problem -- I forgot to drop the reference if folio_trylock()
>>>> failed.  Of course, I can't call folio_put() inside the lock, so may
>>>> as well move the trylock back to the second loop.
>>
>> I think this was a bad thought ...
> 
> The not-taking-folios-off-the-list thought? Yes, agreed.
> 
>>
>>> Dumping all the CPU back traces with gdb, all the cores (except one) are
>>> contending on the the deferred split lock.
>>
>> I'm pretty sure that we can call the shrinker on multiple CPUs at the
>> same time (can you confirm from the backtrace?)
> 
> Yes, the vast majority of the CPUs were in deferred_split_scan() waiting for the
> split_queue_lock.
> 
>>
>>         struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>         struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>
>> so if two CPUs try to shrink the same node, they're going to try to
>> process the same set of folios.  Which means the split will keep failing
>> because each of them will have a refcount on the folio, and ... yeah.
> 
> Ahh, ouch. So this probably explains why things started going slow for me again
> last night.
> 
>>
>> If so, we need to take the folios off the list (or otherwise mark them)
>> so that they can't be processed by more than one CPU at a time.  And
>> that leads me to this patch (yes, folio_prep_large_rmappable() is
>> now vestigial, but removing it increases the churn a bit much for this
>> stage of debugging)
> 
> Looks sensible on first review. I'll do some testing now to see if I can
> re-triger the non-NULL mapping issue. Will get back to you in the next couple of
> hours.
> 
>>
>> This time I've boot-tested it.  I'm running my usual test-suite against
>> it now with little expectation that it will trigger.  If I have time
>> I'll try to recreate your setup.
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fd745bcc97ff..2ca033a6c3d8 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
>>  {
>>  	if (!folio || !folio_test_large(folio))
>>  		return;
>> -	if (folio_order(folio) > 1)
>> -		INIT_LIST_HEAD(&folio->_deferred_list);
>>  	folio_set_large_rmappable(folio);
>>  }
>>  
>> @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>  	unsigned long flags;
>> -	LIST_HEAD(list);
>> +	struct folio_batch batch;
>>  	struct folio *folio, *next;
>>  	int split = 0;
>>  
>> @@ -3321,36 +3319,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  		ds_queue = &sc->memcg->deferred_split_queue;
>>  #endif
>>  
>> +	folio_batch_init(&batch);
>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	/* Take pin on all head pages to avoid freeing them under us */
>> +	/* Take ref on all folios to avoid freeing them under us */
>>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>  							_deferred_list) {
>> -		if (folio_try_get(folio)) {
>> -			list_move(&folio->_deferred_list, &list);
>> -		} else {
>> +		list_del_init(&folio->_deferred_list);
>> +		sc->nr_to_scan--;
>> +		if (!folio_try_get(folio)) {
>>  			/* We lost race with folio_put() */
>> -			list_del_init(&folio->_deferred_list);
>>  			ds_queue->split_queue_len--;

I think split_queue_len is getting out of sync with the number of items on the
queue? We only decrement it if we lost the race with folio_put(). But we are
unconditionally taking folios off the list here. So we are definitely out of
sync until we take the lock again below. But we only put folios back on the list
that failed to split. A successful split used to decrement this variable
(because the folio was on _a_ list). But now it doesn't. So we are always
mismatched after the first failed split?

I'll fix this up before I start testing.

>> +		} else if (folio_batch_add(&batch, folio) == 0) {
>> +			break;
>>  		}
>> -		if (!--sc->nr_to_scan)
>> +		if (!sc->nr_to_scan)
>>  			break;
>>  	}
>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  
>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>>  		if (!folio_trylock(folio))
>> -			goto next;
>> -		/* split_huge_page() removes page from list on success */
>> +			continue;
>>  		if (!split_folio(folio))
>>  			split++;
>>  		folio_unlock(folio);
>> -next:
>> -		folio_put(folio);
>>  	}
>>  
>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	list_splice_tail(&list, &ds_queue->split_queue);
>> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>> +		if (!folio_test_large(folio))
>> +			continue;
>> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>> +	}
>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> +	folios_put(&batch);
>>  
>>  	/*
>>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 1dfdc3bde1b0..14c21d06f233 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>  	atomic_set(&folio->_entire_mapcount, -1);
>>  	atomic_set(&folio->_nr_pages_mapped, 0);
>>  	atomic_set(&folio->_pincount, 0);
>> +	if (order > 1)
>> +		INIT_LIST_HEAD(&folio->_deferred_list);
>>  }
>>  
>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 025ad1a7df7b..fc9c7ca24c4c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>>  		break;
>>  	case 2:
>>  		/*
>> -		 * the second tail page: ->mapping is
>> -		 * deferred_list.next -- ignore value.
>> +		 * the second tail page: ->mapping is deferred_list.next
>>  		 */
>> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
>> +			bad_page(page, "still on deferred list");
>> +			goto out;
>> +		}
>>  		break;
>>  	default:
>>  		if (page->mapping != TAIL_MAPPING) {
>>
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  8:18                           ` Ryan Roberts
@ 2024-03-09  9:38                             ` Ryan Roberts
  2024-03-10  4:23                               ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-09  9:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

[...]
>>>
>>> If so, we need to take the folios off the list (or otherwise mark them)
>>> so that they can't be processed by more than one CPU at a time.  And
>>> that leads me to this patch (yes, folio_prep_large_rmappable() is
>>> now vestigial, but removing it increases the churn a bit much for this
>>> stage of debugging)
>>
>> Looks sensible on first review. I'll do some testing now to see if I can
>> re-triger the non-NULL mapping issue. Will get back to you in the next couple of
>> hours.
>>
[...]
>>> -		if (folio_try_get(folio)) {
>>> -			list_move(&folio->_deferred_list, &list);
>>> -		} else {
>>> +		list_del_init(&folio->_deferred_list);
>>> +		sc->nr_to_scan--;
>>> +		if (!folio_try_get(folio)) {
>>>  			/* We lost race with folio_put() */
>>> -			list_del_init(&folio->_deferred_list);
>>>  			ds_queue->split_queue_len--;
> 
> I think split_queue_len is getting out of sync with the number of items on the
> queue? We only decrement it if we lost the race with folio_put(). But we are
> unconditionally taking folios off the list here. So we are definitely out of
> sync until we take the lock again below. But we only put folios back on the list
> that failed to split. A successful split used to decrement this variable
> (because the folio was on _a_ list). But now it doesn't. So we are always
> mismatched after the first failed split?

Oops, I meant first *sucessful* split.
> 
> I'll fix this up before I start testing.
> 

I've run the full test 5 times, and haven't seen any slow down or RCU stall
warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
did not trigger):

[  944.475632] BUG: Bad page state in process usemem  pfn:252932
[  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
mapping:000000003a777cd9 index:0x1 pfn:0x252932
[  944.478575] aops:0x0 ino:dead000000000122
[  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
[  944.479934] page_type: 0xffffffff()
[  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
fffffc00084a4c90
[  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
0000000000000000
[  944.482475] page dumped because: non-NULL mapping
[  944.482938] Modules linked in:
[  944.483238] CPU: 9 PID: 3805 Comm: usemem Not tainted
6.8.0-rc5-00391-g44b0dc848590 #39
[  944.484308] Hardware name: linux,dummy-virt (DT)
[  944.484981] Call trace:
[  944.485315]  dump_backtrace+0x9c/0x100
[  944.485840]  show_stack+0x20/0x38
[  944.486300]  dump_stack_lvl+0x90/0xb0
[  944.486833]  dump_stack+0x18/0x28
[  944.487178]  bad_page+0x88/0x128
[  944.487523]  __rmqueue_pcplist+0x24c/0xa18
[  944.488001]  get_page_from_freelist+0x278/0x1288
[  944.488579]  __alloc_pages+0xec/0x1018
[  944.489097]  alloc_pages_mpol+0x11c/0x278
[  944.489455]  vma_alloc_folio+0x70/0xd0
[  944.489988]  do_huge_pmd_anonymous_page+0xb8/0xda8
[  944.490819]  __handle_mm_fault+0xd00/0x1a28
[  944.491497]  handle_mm_fault+0x7c/0x418
[  944.491978]  do_page_fault+0x100/0x690
[  944.492357]  do_translation_fault+0xb4/0xd0
[  944.492968]  do_mem_abort+0x4c/0xa8
[  944.493353]  el0_da+0x54/0xb8
[  944.493818]  el0t_64_sync_handler+0xe4/0x158
[  944.494581]  el0t_64_sync+0x190/0x198
[  944.495218] Disabling lock debugging due to kernel taint


So what do we know?

 - the above page looks like it was the 3rd page of a large folio
    - words 3 and 4 are the same, meaning they are likely empty _deferred_list
    - pfn alignment is correct for this
 - The _deferred_list for all previously freed large folios was empty
    - but the folio could have been in the new deferred split batch?
 - free_tail_page_prepare() zeroed mapping/_deferred_list during free
 - _deferred_list was subsequently reinitialized to "empty" while on free list

So how about this for a rough hypothesis:


CPU1                                  CPU2
deferred_split_scan
list_del_init
folio_batch_add
                                      folio_put -> free
		                        free_tail_page_prepare
			                  is on deferred list? -> no
split_huge_page_to_list_to_order
  list_empty(folio->_deferred_list)
    -> yes
  list_del_init
			                  mapping = NULL
					    -> (_deferred_list.prev = NULL)
			                put page on free list
    INIT_LIST_HEAD(entry);
      -> "mapping" no longer NULL


But CPU1 is holding a reference, so that could only happen if a reference was
put one too many times. Ugh.




FYI, this is the fixed-up fix patch I have:


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..fde016451e1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
 {
        if (!folio || !folio_test_large(folio))
                return;
-       if (folio_order(folio) > 1)
-               INIT_LIST_HEAD(&folio->_deferred_list);
        folio_set_large_rmappable(folio);
 }

@@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker
*shrink,
        struct pglist_data *pgdata = NODE_DATA(sc->nid);
        struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
        unsigned long flags;
-       LIST_HEAD(list);
+       struct folio_batch batch;
        struct folio *folio, *next;
        int split = 0;

@@ -3321,36 +3319,39 @@ static unsigned long deferred_split_scan(struct shrinker
*shrink,
                ds_queue = &sc->memcg->deferred_split_queue;
 #endif

+       folio_batch_init(&batch);
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       /* Take pin on all head pages to avoid freeing them under us */
+       /* Take ref on all folios to avoid freeing them under us */
        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                        _deferred_list) {
-               if (folio_try_get(folio)) {
-                       list_move(&folio->_deferred_list, &list);
-               } else {
-                       /* We lost race with folio_put() */
-                       list_del_init(&folio->_deferred_list);
-                       ds_queue->split_queue_len--;
-               }
-               if (!--sc->nr_to_scan)
+               list_del_init(&folio->_deferred_list);
+               ds_queue->split_queue_len--;
+               sc->nr_to_scan--;
+               if (folio_try_get(folio) &&
+                   folio_batch_add(&batch, folio) == 0)
+                       break;
+               if (!sc->nr_to_scan)
                        break;
        }
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);

-       list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+       while ((folio = folio_batch_next(&batch)) != NULL) {
                if (!folio_trylock(folio))
-                       goto next;
-               /* split_huge_page() removes page from list on success */
+                       continue;
                if (!split_folio(folio))
                        split++;
                folio_unlock(folio);
-next:
-               folio_put(folio);
        }

        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       list_splice_tail(&list, &ds_queue->split_queue);
+       while ((folio = folio_batch_next(&batch)) != NULL) {
+               if (!folio_test_large(folio))
+                       continue;
+               list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
+               ds_queue->split_queue_len++;
+       }
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+       folios_put(&batch);

        /*
         * Stop shrinker if we didn't split any page, but the queue is empty.
diff --git a/mm/internal.h b/mm/internal.h
index 1dfdc3bde1b0..14c21d06f233 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page,
unsigned int order)
        atomic_set(&folio->_entire_mapcount, -1);
        atomic_set(&folio->_nr_pages_mapped, 0);
        atomic_set(&folio->_pincount, 0);
+       if (order > 1)
+               INIT_LIST_HEAD(&folio->_deferred_list);
 }

 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 025ad1a7df7b..fc9c7ca24c4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page,
struct page *page)
                break;
        case 2:
                /*
-                * the second tail page: ->mapping is
-                * deferred_list.next -- ignore value.
+                * the second tail page: ->mapping is deferred_list.next
                 */
+               if (unlikely(!list_empty(&folio->_deferred_list))) {
+                       bad_page(page, "still on deferred list");
+                       goto out;
+               }
                break;
        default:
                if (page->mapping != TAIL_MAPPING) {




^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  8:05                                         ` Ryan Roberts
@ 2024-03-09 12:33                                           ` Ryan Roberts
  2024-03-10 13:38                                             ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-09 12:33 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: David Hildenbrand, Zi Yan, linux-mm, Yang Shi, Huang Ying

On 09/03/2024 08:05, Ryan Roberts wrote:
> On 09/03/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Mar 08, 2024 at 08:34:15PM -0800, Andrew Morton wrote:
>>>
>>> We seem to be coming down to the wire on this one - Linus might release
>>> 6.8 this weekend.
>>>
>>> Will simply dropping "mm: allow non-hugetlb large folios to be batch
>>> processed" from mm-stable get us out of trouble?
>>
>> We can add a fix patch which re-narrows the race to the point where it's
>> no longer observable.  Obviously we need to figure out what the real
>> problem is, but we could be going back a long way.  We've definitely
>> found two bugs in the process of investigating the problem (of arguable
>> import; the migration one merely wastes memory temporarily and it's not
>> entirely clear that the wrong-lock problem definitely causes a crash)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 6b697d33fa5b..7b1d3144391b 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>>  			free_huge_folio(folio);
>>  			continue;
>>  		}
>> +		if (folio_test_large(folio) && folio_test_large_rmappable(folio))
>> +			folio_undo_large_rmappable(folio);
>>  
>>  		__page_cache_release(folio, &lruvec, &flags);
>>  
> 
> I agree this is likely to re-hide the problems. But I haven't actually tested it
> on it's own without the other fixes. I'll do some more testing with your latest
> patch and if that doesn't lead anywhere, I'll test with this on its own to check
> that I can no longer reproduce the crashes. If it hides them, I think this is
> the best short-term solution we have right now.

I've tested this workaround on immediately on top of commit f77171d241e3 ("mm:
allow non-hugetlb large folios to be batch processed") and can't reproduce any
problem. I've run the test 32 times. Without the workaround, the biggest number
of test repeats I've managed before seeing a problem is ~5. So I'm confident
this will be sufficient as a short-term solution.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09  9:38                             ` Ryan Roberts
@ 2024-03-10  4:23                               ` Matthew Wilcox
  2024-03-10  8:23                                 ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10  4:23 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote:
> > I think split_queue_len is getting out of sync with the number of items on the
> > queue? We only decrement it if we lost the race with folio_put(). But we are
> > unconditionally taking folios off the list here. So we are definitely out of
> > sync until we take the lock again below. But we only put folios back on the list
> > that failed to split. A successful split used to decrement this variable
> > (because the folio was on _a_ list). But now it doesn't. So we are always
> > mismatched after the first failed split?
> 
> Oops, I meant first *sucessful* split.

Agreed, nice fix.

> I've run the full test 5 times, and haven't seen any slow down or RCU stall
> warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
> did not trigger):
> 
> [  944.475632] BUG: Bad page state in process usemem  pfn:252932
> [  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
> mapping:000000003a777cd9 index:0x1 pfn:0x252932
> [  944.478575] aops:0x0 ino:dead000000000122
> [  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> [  944.479934] page_type: 0xffffffff()
> [  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
> fffffc00084a4c90
> [  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  944.482475] page dumped because: non-NULL mapping

> So what do we know?
> 
>  - the above page looks like it was the 3rd page of a large folio
>     - words 3 and 4 are the same, meaning they are likely empty _deferred_list
>     - pfn alignment is correct for this
>  - The _deferred_list for all previously freed large folios was empty
>     - but the folio could have been in the new deferred split batch?

I don't think it could be in a deferred split bacth because we hold the
refcount at that point ...

>  - free_tail_page_prepare() zeroed mapping/_deferred_list during free
>  - _deferred_list was subsequently reinitialized to "empty" while on free list
> 
> So how about this for a rough hypothesis:
> 
> 
> CPU1                                  CPU2
> deferred_split_scan
> list_del_init
> folio_batch_add
>                                       folio_put -> free
> 		                        free_tail_page_prepare
> 			                  is on deferred list? -> no
> split_huge_page_to_list_to_order
>   list_empty(folio->_deferred_list)
>     -> yes
>   list_del_init
> 			                  mapping = NULL
> 					    -> (_deferred_list.prev = NULL)
> 			                put page on free list
>     INIT_LIST_HEAD(entry);
>       -> "mapping" no longer NULL
> 
> 
> But CPU1 is holding a reference, so that could only happen if a reference was
> put one too many times. Ugh.

Before we start blaming the CPU for doing something impossible, what if
we're taking the wrong lock?  I know that seems crazy, but if page->flags
gets corrupted to the point where we change some of the bits in the
nid, when we free the folio, we call folio_undo_large_rmappable(),
get the wrong ds_queue back from get_deferred_split_queue(), take the
wrong split_queue_lock, corrupt the deferred list of a different node,
and bad things happen?

I don't think we can detect that folio->nid has become corrupted in the
page allocation/freeing code (can we?), but we can tell if a folio is
on the wrong ds_queue in deferred_split_scan():

        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                        _deferred_list) {
+		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
+		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
                list_del_init(&folio->_deferred_list);

(also testing the hypothesis that somehow a split folio has ended up
on the deferred split list)

This wouldn't catch the splat above early, I don't think, but it might
trigger early enough with your workload that it'd be useful information.

(I reviewed the patch you're currently testing with and it matches with
what I think we should be doing)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10  4:23                               ` Matthew Wilcox
@ 2024-03-10  8:23                                 ` Ryan Roberts
  2024-03-10 11:08                                   ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-10  8:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 10/03/2024 04:23, Matthew Wilcox wrote:
> On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote:
>>> I think split_queue_len is getting out of sync with the number of items on the
>>> queue? We only decrement it if we lost the race with folio_put(). But we are
>>> unconditionally taking folios off the list here. So we are definitely out of
>>> sync until we take the lock again below. But we only put folios back on the list
>>> that failed to split. A successful split used to decrement this variable
>>> (because the folio was on _a_ list). But now it doesn't. So we are always
>>> mismatched after the first failed split?
>>
>> Oops, I meant first *sucessful* split.
> 
> Agreed, nice fix.
> 
>> I've run the full test 5 times, and haven't seen any slow down or RCU stall
>> warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
>> did not trigger):
>>
>> [  944.475632] BUG: Bad page state in process usemem  pfn:252932
>> [  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
>> mapping:000000003a777cd9 index:0x1 pfn:0x252932
>> [  944.478575] aops:0x0 ino:dead000000000122
>> [  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
>> [  944.479934] page_type: 0xffffffff()
>> [  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
>> fffffc00084a4c90
>> [  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [  944.482475] page dumped because: non-NULL mapping
> 
>> So what do we know?
>>
>>  - the above page looks like it was the 3rd page of a large folio
>>     - words 3 and 4 are the same, meaning they are likely empty _deferred_list
>>     - pfn alignment is correct for this
>>  - The _deferred_list for all previously freed large folios was empty
>>     - but the folio could have been in the new deferred split batch?
> 
> I don't think it could be in a deferred split bacth because we hold the
> refcount at that point ...
> 
>>  - free_tail_page_prepare() zeroed mapping/_deferred_list during free
>>  - _deferred_list was subsequently reinitialized to "empty" while on free list
>>
>> So how about this for a rough hypothesis:
>>
>>
>> CPU1                                  CPU2
>> deferred_split_scan
>> list_del_init
>> folio_batch_add
>>                                       folio_put -> free
>> 		                        free_tail_page_prepare
>> 			                  is on deferred list? -> no
>> split_huge_page_to_list_to_order
>>   list_empty(folio->_deferred_list)
>>     -> yes
>>   list_del_init
>> 			                  mapping = NULL
>> 					    -> (_deferred_list.prev = NULL)
>> 			                put page on free list
>>     INIT_LIST_HEAD(entry);
>>       -> "mapping" no longer NULL
>>
>>
>> But CPU1 is holding a reference, so that could only happen if a reference was
>> put one too many times. Ugh.
> 
> Before we start blaming the CPU for doing something impossible, 

It doesn't sound completely impossible to me that there is a rare error path that accidentally folio_put()s an extra time...

> what if
> we're taking the wrong lock?  

...but yeah, equally as plausible, I guess.

> I know that seems crazy, but if page->flags
> gets corrupted to the point where we change some of the bits in the
> nid, when we free the folio, we call folio_undo_large_rmappable(),
> get the wrong ds_queue back from get_deferred_split_queue(), take the
> wrong split_queue_lock, corrupt the deferred list of a different node,
> and bad things happen?
> 
> I don't think we can detect that folio->nid has become corrupted in the
> page allocation/freeing code (can we?), but we can tell if a folio is
> on the wrong ds_queue in deferred_split_scan():
> 
>         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>                                                         _deferred_list) {
> +		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
> +		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
>                 list_del_init(&folio->_deferred_list);
> 
> (also testing the hypothesis that somehow a split folio has ended up
> on the deferred split list)

OK, ran with these checks, and get the following oops:

[  411.719461] page:0000000059c1826b refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
[  411.720807] page:0000000059c1826b refcount:0 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
[  411.721792] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
[  411.722453] page_type: 0xffffff7f(buddy)
[  411.722870] raw: 0bfffc0000000000 fffffc001227e808 fffffc002a857408 0000000000000000
[  411.723672] raw: 0000000000000001 0000000000000004 00000000ffffff7f 0000000000000000
[  411.724470] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
[  411.725176] ------------[ cut here ]------------
[  411.725642] kernel BUG at include/linux/mm.h:1191!
[  411.726341] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  411.727021] Modules linked in:
[  411.727329] CPU: 40 PID: 2704 Comm: usemem Not tainted 6.8.0-rc5-00391-g44b0dc848590-dirty #45
[  411.728179] Hardware name: linux,dummy-virt (DT)
[  411.728657] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  411.729381] pc : __dump_page+0x450/0x4a8
[  411.729789] lr : __dump_page+0x450/0x4a8
[  411.730187] sp : ffff80008b97b6f0
[  411.730525] x29: ffff80008b97b6f0 x28: 00000000000000e2 x27: ffff80008b97b988
[  411.731227] x26: ffff80008b97b988 x25: ffff800082105000 x24: 0000000000000001
[  411.731926] x23: 0000000000000000 x22: 0000000000000001 x21: fffffc00221a9000
[  411.732630] x20: fffffc00221a9000 x19: fffffc00221a9000 x18: ffffffffffffffff
[  411.733331] x17: 3030303030303030 x16: 2066376666666666 x15: 076c076f07660721
[  411.734035] x14: 0728074f0749074c x13: 076c076f07660721 x12: 0000000000000000
[  411.734757] x11: 0720072007200729 x10: ffff0013f5e756c0 x9 : ffff80008014b604
[  411.735473] x8 : 00000000ffffbfff x7 : ffff0013f5e756c0 x6 : 0000000000000000
[  411.736198] x5 : ffff0013a5a24d88 x4 : 0000000000000000 x3 : 0000000000000000
[  411.736923] x2 : 0000000000000000 x1 : ffff0000c2849b80 x0 : 000000000000003e
[  411.737621] Call trace:
[  411.737870]  __dump_page+0x450/0x4a8
[  411.738229]  dump_page+0x2c/0x70
[  411.738551]  deferred_split_scan+0x258/0x368
[  411.738973]  do_shrink_slab+0x184/0x750
[  411.739355]  shrink_slab+0x4d4/0x9c0
[  411.739729]  shrink_node+0x214/0x860
[  411.740098]  do_try_to_free_pages+0xd0/0x560
[  411.740540]  try_to_free_mem_cgroup_pages+0x14c/0x330
[  411.741048]  try_charge_memcg+0x1cc/0x788
[  411.741456]  __mem_cgroup_charge+0x6c/0xd0
[  411.741884]  __handle_mm_fault+0x1000/0x1a28
[  411.742306]  handle_mm_fault+0x7c/0x418
[  411.742698]  do_page_fault+0x100/0x690
[  411.743080]  do_translation_fault+0xb4/0xd0
[  411.743508]  do_mem_abort+0x4c/0xa8
[  411.743876]  el0_da+0x54/0xb8
[  411.744177]  el0t_64_sync_handler+0xe4/0x158
[  411.744602]  el0t_64_sync+0x190/0x198
[  411.744976] Code: f000de00 912c4021 9126a000 97f79727 (d4210000) 
[  411.745573] ---[ end trace 0000000000000000 ]---

The new VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); is firing, but then when dump_page() does this:

	if (compound) {
		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
				head, compound_order(head),
				folio_entire_mapcount(folio),
				folio_nr_pages_mapped(folio),
				atomic_read(&folio->_pincount));
	}

VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); inside folio_entire_mapcount() fires so we have a nested oops.

So the very first line is from the first oops and the rest is from the second. I guess we are racing with the page being freed? I find the change in mapcount interesting; 0 -> -128. Not sure why this would happen?

Given the NID check didn't fire, I wonder if this points more towards extra folio_put than corrupt folio nid?



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-06 16:09     ` Matthew Wilcox
  2024-03-06 16:19       ` Ryan Roberts
@ 2024-03-10 11:01       ` Ryan Roberts
  2024-03-10 11:11         ` Matthew Wilcox
  2024-03-10 11:14         ` Ryan Roberts
  1 sibling, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-10 11:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 06/03/2024 16:09, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>> When running some swap tests with this change (which is in mm-stable)
>> present, I see BadThings(TM). Usually I see a "bad page state"
>> followed by a delay of a few seconds, followed by an oops or NULL
>> pointer deref. Bisect points to this change, and if I revert it,
>> the problem goes away.
> 
> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
> the same time and it's all interleaved.  That said, I can pick some
> nuggets out of it.
> 
>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
> 
> These are the two different BUGs being called simultaneously ...
> 
> The first one is bad_page() in page_alloc.c and the second is
> put_page_testzero()
>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> 
> I'm sure it's significant that both of these are the same page (pfn
> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
> time, and one of them underflows.  It probably doesn't matter which call
> trace ends up in bad_page() and which in put_page_testzero().
> 
> One of them is coming from deferred_split_scan(), which is weird because
> we can see the folio_try_get() earlier in the function.  So whatever
> this folio was, we found it on the deferred split list, got its refcount,
> moved it to the local list, either failed to get the lock, or
> successfully got the lock, split it, unlocked it and put it.
> 
> (I can see this was invoked from page fault -> memcg shrinking.  That's
> probably irrelevant but explains some of the functions in the backtrace)
> 
> The other call trace comes from migrate_folio_done() where we're putting
> the _source_ folio.  That was called from migrate_pages_batch() which
> was called from kcompactd.
> 
> Um.  Where do we handle the deferred list in the migration code?
> 
> 
> I've also tried looking at this from a different angle -- what is it
> about this commit that produces this problem?  It's a fairly small
> commit:
> 
> -               if (folio_test_large(folio)) {
> +               /* hugetlb has its own memcg */
> +               if (folio_test_hugetlb(folio)) {
>                         if (lruvec) {
>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>                                 lruvec = NULL;
>                         }
> -                       __folio_put_large(folio);
> +                       free_huge_folio(folio);
> 
> So all that's changed is that large non-hugetlb folios do not call
> __folio_put_large().  As a reminder, that function does:
> 
>         if (!folio_test_hugetlb(folio))
>                 page_cache_release(folio);
>         destroy_large_folio(folio);
> 
> and destroy_large_folio() does:
>         if (folio_test_large_rmappable(folio))
>                 folio_undo_large_rmappable(folio);
> 
>         mem_cgroup_uncharge(folio);
>         free_the_page(&folio->page, folio_order(folio));
> 
> So after my patch, instead of calling (in order):
> 
> 	page_cache_release(folio);
> 	folio_undo_large_rmappable(folio);
> 	mem_cgroup_uncharge(folio);
> 	free_unref_page()
> 
> it calls:
> 
> 	__page_cache_release(folio, &lruvec, &flags);
> 	mem_cgroup_uncharge_folios()
> 	folio_undo_large_rmappable(folio);

I was just looking at this again, and something pops out...

You have swapped the order of folio_undo_large_rmappable() and
mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls
get_deferred_split_queue() which tries to get the split queue from
folio_memcg(folio) first and falls back to pgdat otherwise. If you are now
calling mem_cgroup_uncharge_folios() first, will that remove the folio from the
cgroup? Then we are operating on the wrong list? (just a guess based on the name
of the function...)



> 
> So have I simply widened the window for this race, whatever it is
> exactly?  Something involving mis-handling of the deferred list?
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10  8:23                                 ` Ryan Roberts
@ 2024-03-10 11:08                                   ` Matthew Wilcox
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 11:08 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Sun, Mar 10, 2024 at 08:23:12AM +0000, Ryan Roberts wrote:
> It doesn't sound completely impossible to me that there is a rare error path that accidentally folio_put()s an extra time...

Your debug below seems to prove that it's an extra folio_put()
somewhere.

> >         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
> >                                                         _deferred_list) {
> > +		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
> > +		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
> >                 list_del_init(&folio->_deferred_list);
> > 
> > (also testing the hypothesis that somehow a split folio has ended up
> > on the deferred split list)
> 
> OK, ran with these checks, and get the following oops:
> 
> [  411.719461] page:0000000059c1826b refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
> [  411.720807] page:0000000059c1826b refcount:0 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
> [  411.721792] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> [  411.722453] page_type: 0xffffff7f(buddy)
> [  411.722870] raw: 0bfffc0000000000 fffffc001227e808 fffffc002a857408 0000000000000000
> [  411.723672] raw: 0000000000000001 0000000000000004 00000000ffffff7f 0000000000000000
> [  411.724470] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
> [  411.725176] ------------[ cut here ]------------
> [  411.725642] kernel BUG at include/linux/mm.h:1191!
> [  411.726341] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [  411.727021] Modules linked in:
> [  411.727329] CPU: 40 PID: 2704 Comm: usemem Not tainted 6.8.0-rc5-00391-g44b0dc848590-dirty #45
> [  411.728179] Hardware name: linux,dummy-virt (DT)
> [  411.728657] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  411.729381] pc : __dump_page+0x450/0x4a8
> [  411.729789] lr : __dump_page+0x450/0x4a8
> [  411.730187] sp : ffff80008b97b6f0
> [  411.730525] x29: ffff80008b97b6f0 x28: 00000000000000e2 x27: ffff80008b97b988
> [  411.731227] x26: ffff80008b97b988 x25: ffff800082105000 x24: 0000000000000001
> [  411.731926] x23: 0000000000000000 x22: 0000000000000001 x21: fffffc00221a9000
> [  411.732630] x20: fffffc00221a9000 x19: fffffc00221a9000 x18: ffffffffffffffff
> [  411.733331] x17: 3030303030303030 x16: 2066376666666666 x15: 076c076f07660721
> [  411.734035] x14: 0728074f0749074c x13: 076c076f07660721 x12: 0000000000000000
> [  411.734757] x11: 0720072007200729 x10: ffff0013f5e756c0 x9 : ffff80008014b604
> [  411.735473] x8 : 00000000ffffbfff x7 : ffff0013f5e756c0 x6 : 0000000000000000
> [  411.736198] x5 : ffff0013a5a24d88 x4 : 0000000000000000 x3 : 0000000000000000
> [  411.736923] x2 : 0000000000000000 x1 : ffff0000c2849b80 x0 : 000000000000003e
> [  411.737621] Call trace:
> [  411.737870]  __dump_page+0x450/0x4a8
> [  411.738229]  dump_page+0x2c/0x70
> [  411.738551]  deferred_split_scan+0x258/0x368
> [  411.738973]  do_shrink_slab+0x184/0x750
> 
> The new VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); is firing, but then when dump_page() does this:
> 
> 	if (compound) {
> 		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
> 				head, compound_order(head),
> 				folio_entire_mapcount(folio),
> 				folio_nr_pages_mapped(folio),
> 				atomic_read(&folio->_pincount));
> 	}
> 
> VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); inside folio_entire_mapcount() fires so we have a nested oops.

Ah.  I'm not sure what 44b0dc848590 is -- probably a local commit, but
I guess you don't have fae7d834c43c in it which would prevent the nested
oops.  Nevertheless, the nested oops does tell us something interesting.

> So the very first line is from the first oops and the rest is from the second. I guess we are racing with the page being freed? I find the change in mapcount interesting; 0 -> -128. Not sure why this would happen?

That's PG_buddy being set in PageType.

> Given the NID check didn't fire, I wonder if this points more towards extra folio_put than corrupt folio nid?

Must be if PG_buddy got set.  But we're still left with the question of
how the page gets freed while still being on the deferred list and
doesn't trigger bad_page(page, "still on deferred list") ...

Anyway, we've made some progress.  We now understand how a freed page
gets its deferred list overwritten -- we've found a split page on the
deferred list with refcount 0, we _assumed_ it was still intact and
overwrote a different page's ->mapping.  And it makes sense that my
patch opened the window wider to hit this problem.

I just checked that free_unref_folios() still does the right thing, and
that also relies on the page not yet being split:

                unsigned int order = folio_order(folio);

                if (order > 0 && folio_test_large_rmappable(folio))
                        folio_undo_large_rmappable(folio);
                if (!free_unref_page_prepare(&folio->page, pfn, order))
                        continue;

so there shouldn't be a point in the page freeing process where the
folio is split before we take it off the deferred list.

split_huge_page_to_list_to_order() is also very careful to take the
ds_queue->split_queue_lock before freezing the folio ref, so it's
not a race with that.  I don't see what it is yet.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 11:01       ` Ryan Roberts
@ 2024-03-10 11:11         ` Matthew Wilcox
  2024-03-10 16:31           ` Ryan Roberts
  2024-03-10 11:14         ` Ryan Roberts
  1 sibling, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 11:11 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Sun, Mar 10, 2024 at 11:01:06AM +0000, Ryan Roberts wrote:
> > So after my patch, instead of calling (in order):
> > 
> > 	page_cache_release(folio);
> > 	folio_undo_large_rmappable(folio);
> > 	mem_cgroup_uncharge(folio);
> > 	free_unref_page()
> > 
> > it calls:
> > 
> > 	__page_cache_release(folio, &lruvec, &flags);
> > 	mem_cgroup_uncharge_folios()
> > 	folio_undo_large_rmappable(folio);
> 
> I was just looking at this again, and something pops out...
> 
> You have swapped the order of folio_undo_large_rmappable() and
> mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls
> get_deferred_split_queue() which tries to get the split queue from
> folio_memcg(folio) first and falls back to pgdat otherwise. If you are now
> calling mem_cgroup_uncharge_folios() first, will that remove the folio from the
> cgroup? Then we are operating on the wrong list? (just a guess based on the name
> of the function...)

Oh my.  You've got it.  This explains everything.  Thank you!


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 11:01       ` Ryan Roberts
  2024-03-10 11:11         ` Matthew Wilcox
@ 2024-03-10 11:14         ` Ryan Roberts
  1 sibling, 0 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-10 11:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 10/03/2024 11:01, Ryan Roberts wrote:
> On 06/03/2024 16:09, Matthew Wilcox wrote:
>> On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote:
>>> When running some swap tests with this change (which is in mm-stable)
>>> present, I see BadThings(TM). Usually I see a "bad page state"
>>> followed by a delay of a few seconds, followed by an oops or NULL
>>> pointer deref. Bisect points to this change, and if I revert it,
>>> the problem goes away.
>>
>> That oops is really messed up ;-(  We're clearly got two CPUs oopsing at
>> the same time and it's all interleaved.  That said, I can pick some
>> nuggets out of it.
>>
>>> [   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
>>> [   76.240196] kernel BUG at include/linux/mm.h:1120!
>>
>> These are the two different BUGs being called simultaneously ...
>>
>> The first one is bad_page() in page_alloc.c and the second is
>> put_page_testzero()
>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>
>> I'm sure it's significant that both of these are the same page (pfn
>> 2554a0).  Feels like we have two CPUs calling put_folio() at the same
>> time, and one of them underflows.  It probably doesn't matter which call
>> trace ends up in bad_page() and which in put_page_testzero().
>>
>> One of them is coming from deferred_split_scan(), which is weird because
>> we can see the folio_try_get() earlier in the function.  So whatever
>> this folio was, we found it on the deferred split list, got its refcount,
>> moved it to the local list, either failed to get the lock, or
>> successfully got the lock, split it, unlocked it and put it.
>>
>> (I can see this was invoked from page fault -> memcg shrinking.  That's
>> probably irrelevant but explains some of the functions in the backtrace)
>>
>> The other call trace comes from migrate_folio_done() where we're putting
>> the _source_ folio.  That was called from migrate_pages_batch() which
>> was called from kcompactd.
>>
>> Um.  Where do we handle the deferred list in the migration code?
>>
>>
>> I've also tried looking at this from a different angle -- what is it
>> about this commit that produces this problem?  It's a fairly small
>> commit:
>>
>> -               if (folio_test_large(folio)) {
>> +               /* hugetlb has its own memcg */
>> +               if (folio_test_hugetlb(folio)) {
>>                         if (lruvec) {
>>                                 unlock_page_lruvec_irqrestore(lruvec, flags);
>>                                 lruvec = NULL;
>>                         }
>> -                       __folio_put_large(folio);
>> +                       free_huge_folio(folio);
>>
>> So all that's changed is that large non-hugetlb folios do not call
>> __folio_put_large().  As a reminder, that function does:
>>
>>         if (!folio_test_hugetlb(folio))
>>                 page_cache_release(folio);
>>         destroy_large_folio(folio);
>>
>> and destroy_large_folio() does:
>>         if (folio_test_large_rmappable(folio))
>>                 folio_undo_large_rmappable(folio);
>>
>>         mem_cgroup_uncharge(folio);
>>         free_the_page(&folio->page, folio_order(folio));
>>
>> So after my patch, instead of calling (in order):
>>
>> 	page_cache_release(folio);
>> 	folio_undo_large_rmappable(folio);
>> 	mem_cgroup_uncharge(folio);
>> 	free_unref_page()
>>
>> it calls:
>>
>> 	__page_cache_release(folio, &lruvec, &flags);
>> 	mem_cgroup_uncharge_folios()
>> 	folio_undo_large_rmappable(folio);
> 
> I was just looking at this again, and something pops out...
> 
> You have swapped the order of folio_undo_large_rmappable() and
> mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls
> get_deferred_split_queue() which tries to get the split queue from
> folio_memcg(folio) first and falls back to pgdat otherwise. If you are now
> calling mem_cgroup_uncharge_folios() first, will that remove the folio from the
> cgroup? Then we are operating on the wrong list? (just a guess based on the name
> of the function...)

Infact, looking at mem_cgroup_uncharge_folios() that's exactly what it does - it
calls uncharge_folio(), which zeros memcg_data. And this is completely
consistent with the behaviour I've seen, including the original bisection
result. And it explains why the "workaround to re-narrow" the window is 100%
successful - its reverting the ordering to be correct again.

Assuming you agree, I'll leave you to work up the patch(s).

> 
> 
> 
>>
>> So have I simply widened the window for this race, whatever it is
>> exactly?  Something involving mis-handling of the deferred list?
>>
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-09 12:33                                           ` Ryan Roberts
@ 2024-03-10 13:38                                             ` Matthew Wilcox
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 13:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, linux-mm, Yang Shi, Huang Ying

On Sat, Mar 09, 2024 at 12:33:44PM +0000, Ryan Roberts wrote:
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index 6b697d33fa5b..7b1d3144391b 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> >>  			free_huge_folio(folio);
> >>  			continue;
> >>  		}
> >> +		if (folio_test_large(folio) && folio_test_large_rmappable(folio))
> >> +			folio_undo_large_rmappable(folio);
> >>  
> >>  		__page_cache_release(folio, &lruvec, &flags);
> >>  
> > 
> > I agree this is likely to re-hide the problems. But I haven't actually tested it
> > on it's own without the other fixes. I'll do some more testing with your latest
> > patch and if that doesn't lead anywhere, I'll test with this on its own to check
> > that I can no longer reproduce the crashes. If it hides them, I think this is
> > the best short-term solution we have right now.
> 
> I've tested this workaround on immediately on top of commit f77171d241e3 ("mm:
> allow non-hugetlb large folios to be batch processed") and can't reproduce any
> problem. I've run the test 32 times. Without the workaround, the biggest number
> of test repeats I've managed before seeing a problem is ~5. So I'm confident
> this will be sufficient as a short-term solution.

Let's make this the 6.9 solution.  Having looked at a lot of bits of
the page freeing path, I'll do some significant rearranging for 6.10,
but this is tested and time is short.  Proper patch shortly.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 11:11         ` Matthew Wilcox
@ 2024-03-10 16:31           ` Ryan Roberts
  2024-03-10 19:57             ` Matthew Wilcox
  2024-03-10 19:59             ` Ryan Roberts
  0 siblings, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-10 16:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 10/03/2024 11:11, Matthew Wilcox wrote:
> On Sun, Mar 10, 2024 at 11:01:06AM +0000, Ryan Roberts wrote:
>>> So after my patch, instead of calling (in order):
>>>
>>> 	page_cache_release(folio);
>>> 	folio_undo_large_rmappable(folio);
>>> 	mem_cgroup_uncharge(folio);
>>> 	free_unref_page()
>>>
>>> it calls:
>>>
>>> 	__page_cache_release(folio, &lruvec, &flags);
>>> 	mem_cgroup_uncharge_folios()
>>> 	folio_undo_large_rmappable(folio);
>>
>> I was just looking at this again, and something pops out...
>>
>> You have swapped the order of folio_undo_large_rmappable() and
>> mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls
>> get_deferred_split_queue() which tries to get the split queue from
>> folio_memcg(folio) first and falls back to pgdat otherwise. If you are now
>> calling mem_cgroup_uncharge_folios() first, will that remove the folio from the
>> cgroup? Then we are operating on the wrong list? (just a guess based on the name
>> of the function...)
> 
> Oh my.  You've got it.  This explains everything.  Thank you!

I've just taken today's mm-unstable, added your official patch to fix the ordering and applied my large folio swap-out series on top (v4, which I haven't posted yet). In testing that, I'm seeing another oops :-( 

That's exactly how I discovered the original problem, and was hoping that with your fix, this would unblock me. Given I can only repro this when my changes are on top, I guess my code is most likely buggy, but perhaps you can take a quick look at the oops and tell me what you think?

[   96.372503] BUG: Bad page state in process usemem  pfn:be502
[   96.373336] page: refcount:0 mapcount:0 mapping:000000005abfa8d5 index:0x0 pfn:0xbe502
[   96.374341] aops:0x0 ino:fffffc0001f940c8
[   96.374893] flags: 0x7fff8000000000(node=0|zone=0|lastcpupid=0xffff)
[   96.375653] page_type: 0xffffffff()
[   96.376071] raw: 007fff8000000000 0000000000000000 fffffc0001f94090 ffff0000c99ee860
[   96.377055] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   96.378650] page dumped because: non-NULL mapping
[   96.379828] Modules linked in: binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore ip_tables x_tables autofs4 xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 crct10dif_ce ghash_ce sha2_ce virtio_net sha256_arm64 net_failover sha1_ce virtio_blk failover virtio_scsi virtio_rng aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[   96.386802] CPU: 13 PID: 4713 Comm: usemem Not tainted 6.8.0-rc5-ryarob01-swap-out-v4 #2
[   96.387691] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[   96.388887] Call trace:
[   96.389348]  dump_backtrace+0x9c/0x128
[   96.390213]  show_stack+0x20/0x38
[   96.390688]  dump_stack_lvl+0x78/0xc8
[   96.391163]  dump_stack+0x18/0x28
[   96.391545]  bad_page+0x88/0x128
[   96.391893]  get_page_from_freelist+0xa94/0x1bc0
[   96.392407]  __alloc_pages+0x194/0x10b0
[   96.392833]  alloc_pages_mpol+0x98/0x278
[   96.393278]  vma_alloc_folio+0x74/0xd8
[   96.393674]  __handle_mm_fault+0x7ac/0x1470
[   96.394146]  handle_mm_fault+0x70/0x2c8
[   96.394575]  do_page_fault+0x100/0x530
[   96.395013]  do_translation_fault+0xa4/0xd0
[   96.395476]  do_mem_abort+0x4c/0xa8
[   96.395869]  el0_da+0x30/0xa8
[   96.396229]  el0t_64_sync_handler+0xb4/0x130
[   96.396735]  el0t_64_sync+0x1a8/0x1b0
[   96.397133] Disabling lock debugging due to kernel taint
[  112.507052] Adding 36700156k swap on /dev/ram0.  Priority:-2 extents:1 across:36700156k SS
[  113.131515] ------------[ cut here ]------------
[  113.132190] UBSAN: array-index-out-of-bounds in mm/vmscan.c:1654:14
[  113.132892] index 7 is out of range for type 'long unsigned int [5]'
[  113.133617] CPU: 9 PID: 528 Comm: kswapd0 Tainted: G    B              6.8.0-rc5-ryarob01-swap-out-v4 #2
[  113.134705] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  113.135500] Call trace:
[  113.135776]  dump_backtrace+0x9c/0x128
[  113.136218]  show_stack+0x20/0x38
[  113.136574]  dump_stack_lvl+0x78/0xc8
[  113.136964]  dump_stack+0x18/0x28
[  113.137322]  __ubsan_handle_out_of_bounds+0xa0/0xd8
[  113.137885]  isolate_lru_folios+0x57c/0x658
[  113.138352]  shrink_lruvec+0x5b4/0xdf8
[  113.138751]  shrink_node+0x3f0/0x990
[  113.139152]  balance_pgdat+0x3d0/0x810
[  113.139579]  kswapd+0x268/0x568
[  113.139936]  kthread+0x118/0x128
[  113.140289]  ret_from_fork+0x10/0x20
[  113.140686] ---[ end trace ]---

The UBSAN issue reported for mm/vmscan.c:1654 is:

nr_skipped[folio_zonenum(folio)] += nr_pages;

nr_skipped is a stack array of 5 elements. So I guess folio_zonemem(folio) is returning 7. That comes from the flags. I guess this is most likely just a side effect of the corrupted folio due to someone writing to it while its on the free list?




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 16:31           ` Ryan Roberts
@ 2024-03-10 19:57             ` Matthew Wilcox
  2024-03-10 19:59             ` Ryan Roberts
  1 sibling, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 19:57 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Sun, Mar 10, 2024 at 04:31:25PM +0000, Ryan Roberts wrote:
> That's exactly how I discovered the original problem, and was hoping
> that with your fix, this would unblock me. Given I can only repro this
> when my changes are on top, I guess my code is most likely buggy,
> but perhaps you can take a quick look at the oops and tell me what
> you think?

Well, now my code isn't implicated, I have no interest in helping you.

Just kidding ;-)

> [   96.372503] BUG: Bad page state in process usemem  pfn:be502
> [   96.373336] page: refcount:0 mapcount:0 mapping:000000005abfa8d5 index:0x0 pfn:0xbe502
> [   96.374341] aops:0x0 ino:fffffc0001f940c8
> [   96.374893] flags: 0x7fff8000000000(node=0|zone=0|lastcpupid=0xffff)
> [   96.375653] page_type: 0xffffffff()
> [   96.376071] raw: 007fff8000000000 0000000000000000 fffffc0001f94090 ffff0000c99ee860
> [   96.377055] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> [   96.378650] page dumped because: non-NULL mapping

OK, so page->mapping is ffff0000c99ee860 which does look plausible.
At least it's not a deferred_list (although it is a pfn suitable for
having a deferred_list ... for any allocation up to order-9)

> [   96.390688]  dump_stack_lvl+0x78/0xc8
> [   96.391163]  dump_stack+0x18/0x28
> [   96.391545]  bad_page+0x88/0x128
> [   96.391893]  get_page_from_freelist+0xa94/0x1bc0
> [   96.392407]  __alloc_pages+0x194/0x10b0


> [  113.131515] ------------[ cut here ]------------
> [  113.132190] UBSAN: array-index-out-of-bounds in mm/vmscan.c:1654:14
> [  113.132892] index 7 is out of range for type 'long unsigned int [5]'
> [  113.133617] CPU: 9 PID: 528 Comm: kswapd0 Tainted: G    B              6.8.0-rc5-ryarob01-swap-out-v4 #2
> [  113.134705] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [  113.135500] Call trace:
> [  113.135776]  dump_backtrace+0x9c/0x128
> [  113.136218]  show_stack+0x20/0x38
> [  113.136574]  dump_stack_lvl+0x78/0xc8
> [  113.136964]  dump_stack+0x18/0x28
> [  113.137322]  __ubsan_handle_out_of_bounds+0xa0/0xd8
> [  113.137885]  isolate_lru_folios+0x57c/0x658

I wish it weren't UBSAN reporting this, then we could get the folio
dumped.  I suppose we could put in an explicit check for folio_zonenum()
being > 5.  Does it usually happed in isolate_lru_folio()?

> nr_skipped is a stack array of 5 elements. So I guess folio_zonemem(folio) is returning 7. That comes from the flags. I guess this is most likely just a side effect of the corrupted folio due to someone writing to it while its on the free list?

Or it's a pointer to something that's not a folio?  Are we taking the
wrong lock somewhere again?



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 16:31           ` Ryan Roberts
  2024-03-10 19:57             ` Matthew Wilcox
@ 2024-03-10 19:59             ` Ryan Roberts
  2024-03-10 20:46               ` Matthew Wilcox
  1 sibling, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-10 19:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 10/03/2024 16:31, Ryan Roberts wrote:
> On 10/03/2024 11:11, Matthew Wilcox wrote:
>> On Sun, Mar 10, 2024 at 11:01:06AM +0000, Ryan Roberts wrote:
>>>> So after my patch, instead of calling (in order):
>>>>
>>>> 	page_cache_release(folio);
>>>> 	folio_undo_large_rmappable(folio);
>>>> 	mem_cgroup_uncharge(folio);
>>>> 	free_unref_page()
>>>>
>>>> it calls:
>>>>
>>>> 	__page_cache_release(folio, &lruvec, &flags);
>>>> 	mem_cgroup_uncharge_folios()
>>>> 	folio_undo_large_rmappable(folio);
>>>
>>> I was just looking at this again, and something pops out...
>>>
>>> You have swapped the order of folio_undo_large_rmappable() and
>>> mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls
>>> get_deferred_split_queue() which tries to get the split queue from
>>> folio_memcg(folio) first and falls back to pgdat otherwise. If you are now
>>> calling mem_cgroup_uncharge_folios() first, will that remove the folio from the
>>> cgroup? Then we are operating on the wrong list? (just a guess based on the name
>>> of the function...)
>>
>> Oh my.  You've got it.  This explains everything.  Thank you!
> 
> I've just taken today's mm-unstable, added your official patch to fix the ordering and applied my large folio swap-out series on top (v4, which I haven't posted yet). In testing that, I'm seeing another oops :-( 
> 
> That's exactly how I discovered the original problem, and was hoping that with your fix, this would unblock me. Given I can only repro this when my changes are on top, I guess my code is most likely buggy, but perhaps you can take a quick look at the oops and tell me what you think?

I've now been able to repro this without any of my code on top - just mm-unstable and your fix for the the memcg uncharging ordering issue. So we have separate, more difficultt to repro bug. I've discovered CONFIG_DEBUG_LIST so enabled that. I'll try to bisect in the morning, but I suspect it will be slow going.

[  390.317982] ------------[ cut here ]------------
[  390.318646] list_del corruption. prev->next should be fffffc00152a9090, but was fffffc002798a490. (prev=fffffc002798a490)
[  390.319895] WARNING: CPU: 28 PID: 3187 at lib/list_debug.c:62 __list_del_entry_valid_or_report+0xe0/0x110
[  390.320957] Modules linked in:
[  390.321295] CPU: 28 PID: 3187 Comm: usemem Not tainted 6.8.0-rc5-00462-gdbdeae0a47d9 #4
[  390.322432] Hardware name: linux,dummy-virt (DT)
[  390.323078] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  390.324187] pc : __list_del_entry_valid_or_report+0xe0/0x110
[  390.325156] lr : __list_del_entry_valid_or_report+0xe0/0x110
[  390.326179] sp : ffff800087fcb6e0
[  390.326730] x29: ffff800087fcb6e0 x28: 0000fffff7e00000 x27: ffff0005c1c0a790
[  390.327897] x26: ffff00116f44c010 x25: 0000000000000090 x24: 0000000000000001
[  390.329021] x23: ffff800082e2a660 x22: 00000000000000c0 x21: fffffc00152a9090
[  390.330344] x20: ffff0000c7d30818 x19: fffffc00152a9000 x18: 0000000000000006
[  390.331513] x17: 20747562202c3039 x16: 3039613235313030 x15: 6366666666662065
[  390.332607] x14: 6220646c756f6873 x13: 2930393461383937 x12: 3230306366666666
[  390.333713] x11: 663d766572702820 x10: ffff0013f5e7b7c0 x9 : ffff800080128e84
[  390.334945] x8 : 00000000ffffbfff x7 : ffff0013f5e7b7c0 x6 : 80000000ffffc000
[  390.336235] x5 : ffff0013a58ecd08 x4 : 0000000000000000 x3 : ffff8013235c7000
[  390.337435] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00010fe79140
[  390.338501] Call trace:
[  390.338800]  __list_del_entry_valid_or_report+0xe0/0x110
[  390.339704]  folio_undo_large_rmappable+0xb8/0x128
[  390.340572]  folios_put_refs+0x1e4/0x200
[  390.341201]  free_pages_and_swap_cache+0xf0/0x178
[  390.342074]  __tlb_batch_free_encoded_pages+0x54/0xf0
[  390.342898]  tlb_flush_mmu+0x5c/0xe0
[  390.343466]  unmap_page_range+0x960/0xe48
[  390.344112]  unmap_single_vma.constprop.0+0x90/0x118
[  390.344948]  unmap_vmas+0x84/0x180
[  390.345576]  unmap_region+0xdc/0x170
[  390.346208]  do_vmi_align_munmap+0x464/0x5f0
[  390.346988]  do_vmi_munmap+0xb4/0x138
[  390.347657]  __vm_munmap+0xa8/0x188
[  390.348061]  __arm64_sys_munmap+0x28/0x40
[  390.348513]  invoke_syscall+0x50/0x128
[  390.348952]  el0_svc_common.constprop.0+0x48/0xf0
[  390.349494]  do_el0_svc+0x24/0x38
[  390.350085]  el0_svc+0x34/0xb8
[  390.350486]  el0t_64_sync_handler+0x100/0x130
[  390.351256]  el0t_64_sync+0x190/0x198
[  390.351823] ---[ end trace 0000000000000000 ]---


> 
> [   96.372503] BUG: Bad page state in process usemem  pfn:be502
> [   96.373336] page: refcount:0 mapcount:0 mapping:000000005abfa8d5 index:0x0 pfn:0xbe502
> [   96.374341] aops:0x0 ino:fffffc0001f940c8
> [   96.374893] flags: 0x7fff8000000000(node=0|zone=0|lastcpupid=0xffff)
> [   96.375653] page_type: 0xffffffff()
> [   96.376071] raw: 007fff8000000000 0000000000000000 fffffc0001f94090 ffff0000c99ee860
> [   96.377055] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> [   96.378650] page dumped because: non-NULL mapping
> [   96.379828] Modules linked in: binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore ip_tables x_tables autofs4 xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 crct10dif_ce ghash_ce sha2_ce virtio_net sha256_arm64 net_failover sha1_ce virtio_blk failover virtio_scsi virtio_rng aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
> [   96.386802] CPU: 13 PID: 4713 Comm: usemem Not tainted 6.8.0-rc5-ryarob01-swap-out-v4 #2
> [   96.387691] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [   96.388887] Call trace:
> [   96.389348]  dump_backtrace+0x9c/0x128
> [   96.390213]  show_stack+0x20/0x38
> [   96.390688]  dump_stack_lvl+0x78/0xc8
> [   96.391163]  dump_stack+0x18/0x28
> [   96.391545]  bad_page+0x88/0x128
> [   96.391893]  get_page_from_freelist+0xa94/0x1bc0
> [   96.392407]  __alloc_pages+0x194/0x10b0
> [   96.392833]  alloc_pages_mpol+0x98/0x278
> [   96.393278]  vma_alloc_folio+0x74/0xd8
> [   96.393674]  __handle_mm_fault+0x7ac/0x1470
> [   96.394146]  handle_mm_fault+0x70/0x2c8
> [   96.394575]  do_page_fault+0x100/0x530
> [   96.395013]  do_translation_fault+0xa4/0xd0
> [   96.395476]  do_mem_abort+0x4c/0xa8
> [   96.395869]  el0_da+0x30/0xa8
> [   96.396229]  el0t_64_sync_handler+0xb4/0x130
> [   96.396735]  el0t_64_sync+0x1a8/0x1b0
> [   96.397133] Disabling lock debugging due to kernel taint
> [  112.507052] Adding 36700156k swap on /dev/ram0.  Priority:-2 extents:1 across:36700156k SS
> [  113.131515] ------------[ cut here ]------------
> [  113.132190] UBSAN: array-index-out-of-bounds in mm/vmscan.c:1654:14
> [  113.132892] index 7 is out of range for type 'long unsigned int [5]'
> [  113.133617] CPU: 9 PID: 528 Comm: kswapd0 Tainted: G    B              6.8.0-rc5-ryarob01-swap-out-v4 #2
> [  113.134705] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [  113.135500] Call trace:
> [  113.135776]  dump_backtrace+0x9c/0x128
> [  113.136218]  show_stack+0x20/0x38
> [  113.136574]  dump_stack_lvl+0x78/0xc8
> [  113.136964]  dump_stack+0x18/0x28
> [  113.137322]  __ubsan_handle_out_of_bounds+0xa0/0xd8
> [  113.137885]  isolate_lru_folios+0x57c/0x658
> [  113.138352]  shrink_lruvec+0x5b4/0xdf8
> [  113.138751]  shrink_node+0x3f0/0x990
> [  113.139152]  balance_pgdat+0x3d0/0x810
> [  113.139579]  kswapd+0x268/0x568
> [  113.139936]  kthread+0x118/0x128
> [  113.140289]  ret_from_fork+0x10/0x20
> [  113.140686] ---[ end trace ]---
> 
> The UBSAN issue reported for mm/vmscan.c:1654 is:
> 
> nr_skipped[folio_zonenum(folio)] += nr_pages;
> 
> nr_skipped is a stack array of 5 elements. So I guess folio_zonemem(folio) is returning 7. That comes from the flags. I guess this is most likely just a side effect of the corrupted folio due to someone writing to it while its on the free list?
> 
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 19:59             ` Ryan Roberts
@ 2024-03-10 20:46               ` Matthew Wilcox
  2024-03-10 21:52                 ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 20:46 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Sun, Mar 10, 2024 at 07:59:46PM +0000, Ryan Roberts wrote:
> I've now been able to repro this without any of my code on top - just mm-unstable and your fix for the the memcg uncharging ordering issue. So we have separate, more difficultt to repro bug. I've discovered CONFIG_DEBUG_LIST so enabled that. I'll try to bisect in the morning, but I suspect it will be slow going.
> 
> [  390.317982] ------------[ cut here ]------------
> [  390.318646] list_del corruption. prev->next should be fffffc00152a9090, but was fffffc002798a490. (prev=fffffc002798a490)

Interesting.  So prev->next is pointing to prev, ie prev is an empty
list, but it should be pointing to this entry ... this is feeling like
another missing lock.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 20:46               ` Matthew Wilcox
@ 2024-03-10 21:52                 ` Matthew Wilcox
  2024-03-11  9:01                   ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-10 21:52 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Sun, Mar 10, 2024 at 08:46:58PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 10, 2024 at 07:59:46PM +0000, Ryan Roberts wrote:
> > I've now been able to repro this without any of my code on top - just mm-unstable and your fix for the the memcg uncharging ordering issue. So we have separate, more difficultt to repro bug. I've discovered CONFIG_DEBUG_LIST so enabled that. I'll try to bisect in the morning, but I suspect it will be slow going.
> > 
> > [  390.317982] ------------[ cut here ]------------
> > [  390.318646] list_del corruption. prev->next should be fffffc00152a9090, but was fffffc002798a490. (prev=fffffc002798a490)
> 
> Interesting.  So prev->next is pointing to prev, ie prev is an empty
> list, but it should be pointing to this entry ... this is feeling like
> another missing lock.

Let's check that we're not inverting the order of memcg_uncharge and
removing a folio from the deferred list (build tested only, but only
one line of this will be new to you):

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb57b3d0c8cd..61fd1a4b424d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
 {
 	if (!folio || !folio_test_large(folio))
 		return;
-	if (folio_order(folio) > 1)
-		INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_large_rmappable(folio);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 79d0848c10a5..690c68c18c23 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -525,6 +525,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
+	if (order > 1)
+		INIT_LIST_HEAD(&folio->_deferred_list);
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 138bcfa18234..e2334c4ee550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
+			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bdff5c0a7c76..1c1925b92934 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1006,10 +1006,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		}
 		break;
 	case 2:
-		/*
-		 * the second tail page: ->mapping is
-		 * deferred_list.next -- ignore value.
-		 */
+		/* the second tail page: deferred_list overlaps ->mapping */
+		if (unlikely(!list_empty(&folio->_deferred_list))) {
+			bad_page(page, "on deferred list");
+			goto out;
+		}
 		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-10 21:52                 ` Matthew Wilcox
@ 2024-03-11  9:01                   ` Ryan Roberts
  2024-03-11 12:26                     ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-11  9:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 10/03/2024 21:52, Matthew Wilcox wrote:
> On Sun, Mar 10, 2024 at 08:46:58PM +0000, Matthew Wilcox wrote:
>> On Sun, Mar 10, 2024 at 07:59:46PM +0000, Ryan Roberts wrote:
>>> I've now been able to repro this without any of my code on top - just mm-unstable and your fix for the the memcg uncharging ordering issue. So we have separate, more difficultt to repro bug. I've discovered CONFIG_DEBUG_LIST so enabled that. I'll try to bisect in the morning, but I suspect it will be slow going.
>>>
>>> [  390.317982] ------------[ cut here ]------------
>>> [  390.318646] list_del corruption. prev->next should be fffffc00152a9090, but was fffffc002798a490. (prev=fffffc002798a490)
>>
>> Interesting.  So prev->next is pointing to prev, ie prev is an empty
>> list, but it should be pointing to this entry ... this is feeling like
>> another missing lock.
> 
> Let's check that we're not inverting the order of memcg_uncharge and
> removing a folio from the deferred list (build tested only, but only
> one line of this will be new to you):

OK found it - its another instance of the same issue...

Applied your below patch (resulting code: mm-unstable (d7182786dd0a) + yesterday's fix ("mm: Remove folio from deferred split list before uncharging it") + below patch).

The new check triggered:

[  153.459843] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffd5fc0 pfn:0x4da690
[  153.460667] head: order:4 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  153.461218] memcg:ffff0000c7fa1000
[  153.461519] anon flags: 0xbfffc00000a0048(uptodate|head|mappedtodisk|swapbacked|node=0|zone=2|lastcpupid=0xffff)
[  153.462678] page_type: 0xffffffff()
[  153.463294] raw: 0bfffc00000a0048 dead000000000100 dead000000000122 ffff0000fbfa29c1
[  153.470267] raw: 0000000ffffd5fc0 0000000000000000 00000000ffffffff ffff0000c7fa1000
[  153.471395] head: 0bfffc00000a0048 dead000000000100 dead000000000122 ffff0000fbfa29c1
[  153.472494] head: 0000000ffffd5fc0 0000000000000000 00000000ffffffff ffff0000c7fa1000
[  153.473357] head: 0bfffc0000020204 fffffc001269a401 dead000000000122 00000000ffffffff
[  153.481663] head: 0000001000000000 0000000000000000 00000000ffffffff 0000000000000000
[  153.482438] page dumped because: VM_BUG_ON_FOLIO(folio_order(folio) > 1 && !list_empty(&folio->_deferred_list))
[  153.483464] ------------[ cut here ]------------
[  153.484000] kernel BUG at mm/memcontrol.c:7486!
[  153.484484] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  153.485249] Modules linked in:
[  153.485621] CPU: 33 PID: 2146 Comm: usemem Not tainted 6.8.0-rc5-00463-gb5100df1d6f3 #5
[  153.486552] Hardware name: linux,dummy-virt (DT)
[  153.487300] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  153.488363] pc : uncharge_folio+0x1d0/0x2c8
[  153.488922] lr : uncharge_folio+0x1d0/0x2c8
[  153.489384] sp : ffff80008ea0b6d0
[  153.489747] x29: ffff80008ea0b6d0 x28: 0000000000000000 x27: 00000000fffffffe
[  153.490626] x26: dead000000000100 x25: dead000000000122 x24: 0000000000000020
[  153.491435] x23: ffff80008ea0b918 x22: ffff0000c7f88850 x21: ffff0000c7f88800
[  153.492255] x20: ffff80008ea0b730 x19: fffffc001269a400 x18: 0000000000000006
[  153.493087] x17: 212026262031203e x16: 20296f696c6f6628 x15: 0720072007200720
[  153.494175] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
[  153.495186] x11: 0720072007200720 x10: ffff0013f5e7b7c0 x9 : ffff800080128e84
[  153.496142] x8 : 00000000ffffbfff x7 : ffff0013f5e7b7c0 x6 : 80000000ffffc000
[  153.497050] x5 : ffff0013a5987d08 x4 : 0000000000000000 x3 : 0000000000000000
[  153.498041] x2 : 0000000000000000 x1 : ffff0000cbc2c500 x0 : 0000000000000063
[  153.499149] Call trace:
[  153.499470]  uncharge_folio+0x1d0/0x2c8
[  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
[  153.500795]  move_folios_to_lru+0x5bc/0x5e0
[  153.501275]  shrink_lruvec+0x5f8/0xb30
[  153.501833]  shrink_node+0x4d8/0x8b0
[  153.502227]  do_try_to_free_pages+0xe0/0x5a8
[  153.502835]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[  153.503708]  try_charge_memcg+0x114/0x658
[  153.504344]  __mem_cgroup_charge+0x6c/0xd0
[  153.505007]  __handle_mm_fault+0x42c/0x1640
[  153.505684]  handle_mm_fault+0x70/0x290
[  153.506136]  do_page_fault+0xfc/0x4d8
[  153.506659]  do_translation_fault+0xa4/0xc0
[  153.507140]  do_mem_abort+0x4c/0xa8
[  153.507716]  el0_da+0x2c/0x78
[  153.508169]  el0t_64_sync_handler+0xb8/0x130
[  153.508810]  el0t_64_sync+0x190/0x198
[  153.509410] Code: 910c8021 a9025bf5 a90363f7 97fd7bef (d4210000) 
[  153.510309] ---[ end trace 0000000000000000 ]---
[  153.510974] Kernel panic - not syncing: Oops - BUG: Fatal exception
[  153.511727] SMP: stopping secondary CPUs
[  153.513519] Kernel Offset: disabled
[  153.514090] CPU features: 0x0,00000020,7002014a,2140720b
[  153.514960] Memory Limit: none
[  153.515457] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---


move_folios_to_lru+0x5bc/0x5e0 is:

static unsigned int move_folios_to_lru(struct lruvec *lruvec,
		struct list_head *list)
{
	...

	if (free_folios.nr) {
		spin_unlock_irq(&lruvec->lru_lock);
		mem_cgroup_uncharge_folios(&free_folios);  <<<<<<<<<<< HERE
		free_unref_folios(&free_folios);
		spin_lock_irq(&lruvec->lru_lock);
	}

	return nr_moved;
}

And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.

A quick grep over the entire series has a lot of hits for "uncharge". I wonder if we need a full audit of that series for other places that could potentially be doing the same thing?


> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb57b3d0c8cd..61fd1a4b424d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
>  {
>  	if (!folio || !folio_test_large(folio))
>  		return;
> -	if (folio_order(folio) > 1)
> -		INIT_LIST_HEAD(&folio->_deferred_list);
>  	folio_set_large_rmappable(folio);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 79d0848c10a5..690c68c18c23 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -525,6 +525,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  	atomic_set(&folio->_entire_mapcount, -1);
>  	atomic_set(&folio->_nr_pages_mapped, 0);
>  	atomic_set(&folio->_pincount, 0);
> +	if (order > 1)
> +		INIT_LIST_HEAD(&folio->_deferred_list);
>  }
>  
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 138bcfa18234..e2334c4ee550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	struct obj_cgroup *objcg;
>  
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> +			!list_empty(&folio->_deferred_list), folio);
>  
>  	/*
>  	 * Nobody should be changing or seriously looking at
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bdff5c0a7c76..1c1925b92934 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1006,10 +1006,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  		}
>  		break;
>  	case 2:
> -		/*
> -		 * the second tail page: ->mapping is
> -		 * deferred_list.next -- ignore value.
> -		 */
> +		/* the second tail page: deferred_list overlaps ->mapping */
> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> +			bad_page(page, "on deferred list");
> +			goto out;
> +		}
>  		break;
>  	default:
>  		if (page->mapping != TAIL_MAPPING) {



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11  9:01                   ` Ryan Roberts
@ 2024-03-11 12:26                     ` Matthew Wilcox
  2024-03-11 12:36                       ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-11 12:26 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote:
> [  153.499149] Call trace:
> [  153.499470]  uncharge_folio+0x1d0/0x2c8
> [  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
> [  153.500795]  move_folios_to_lru+0x5bc/0x5e0
> [  153.501275]  shrink_lruvec+0x5f8/0xb30

> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.

Two sites, but basically the same thing; one is for "the batch is full"
and the other is "we finished the list".

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0e53999a865..f60c5b3977dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
+			if (folio_test_large(folio) &&
+			    folio_test_large_rmappable(folio))
+				folio_undo_large_rmappable(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);

> A quick grep over the entire series has a lot of hits for "uncharge". I
> wonder if we need a full audit of that series for other places that
> could potentially be doing the same thing?

I think this assertion will catch all occurrences of the same thing,
as long as people who are testing are testing in a memcg.  My setup
doesn't use a memcg, so I never saw any of this ;-(

If you confirm this fixes it, I'll send two patches; a respin of the patch
I sent on Sunday that calls undo_large_rmappable in this one extra place,
and then a patch to add the assertions.


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 12:26                     ` Matthew Wilcox
@ 2024-03-11 12:36                       ` Ryan Roberts
  2024-03-11 15:50                         ` Matthew Wilcox
  0 siblings, 1 reply; 73+ messages in thread
From: Ryan Roberts @ 2024-03-11 12:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 11/03/2024 12:26, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote:
>> [  153.499149] Call trace:
>> [  153.499470]  uncharge_folio+0x1d0/0x2c8
>> [  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
>> [  153.500795]  move_folios_to_lru+0x5bc/0x5e0
>> [  153.501275]  shrink_lruvec+0x5f8/0xb30
> 
>> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.
> 
> Two sites, but basically the same thing; one is for "the batch is full"
> and the other is "we finished the list".
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a0e53999a865..f60c5b3977dc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>  		if (unlikely(folio_put_testzero(folio))) {
>  			__folio_clear_lru_flags(folio);
>  
> +			if (folio_test_large(folio) &&
> +			    folio_test_large_rmappable(folio))
> +				folio_undo_large_rmappable(folio);
>  			if (folio_batch_add(&free_folios, folio) == 0) {
>  				spin_unlock_irq(&lruvec->lru_lock);
>  				mem_cgroup_uncharge_folios(&free_folios);
> 
>> A quick grep over the entire series has a lot of hits for "uncharge". I
>> wonder if we need a full audit of that series for other places that
>> could potentially be doing the same thing?
> 
> I think this assertion will catch all occurrences of the same thing,
> as long as people who are testing are testing in a memcg.  My setup
> doesn't use a memcg, so I never saw any of this ;-(
> 
> If you confirm this fixes it, I'll send two patches; a respin of the patch
> I sent on Sunday that calls undo_large_rmappable in this one extra place,
> and then a patch to add the assertions.

Good timing on your response - I've just finished testing! Although my patch
included both the site you have above and another that I fixed up speculatively
in shrink_folio_list() based on reviewing all mem_cgroup_uncharge_folios() call
sites.

I haven't been able to reproduce any issue with this patch (and the extra
asserts) in place. I've run ~50 iterations over ~2 hours. Previous record was
about 30 iterations before catching an oops. Given how difficult it now is to
repro, I can't be sure this has definitely fixed all possible places, but its
looking positive.


diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0e53999a865..cf7d4cf47f1a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1436,6 +1436,9 @@ static unsigned int shrink_folio_list(struct list_head
*folio_list,
                 */
                nr_reclaimed += nr_pages;

+               if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+                       folio_undo_large_rmappable(folio);
+
                if (folio_batch_add(&free_folios, folio) == 0) {
                        mem_cgroup_uncharge_folios(&free_folios);
                        try_to_unmap_flush();
@@ -1842,6 +1845,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
                if (unlikely(folio_put_testzero(folio))) {
                        __folio_clear_lru_flags(folio);

+                       if (folio_test_large(folio) &&
folio_test_large_rmappable(folio))
+                               folio_undo_large_rmappable(folio);
+
                        if (folio_batch_add(&free_folios, folio) == 0) {
                                spin_unlock_irq(&lruvec->lru_lock);
                                mem_cgroup_uncharge_folios(&free_folios);



There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which
I couldn't convince myself was safe. Perhaps you could do a quick audit of the
call sites?

But taking a step back, I wonder whether the charge() and uncharge() functions
should really be checking to see if the folio is on a deferred split list and if
so, then move the folio to the corect list?




^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 12:36                       ` Ryan Roberts
@ 2024-03-11 15:50                         ` Matthew Wilcox
  2024-03-11 16:14                           ` Ryan Roberts
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-11 15:50 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Mon, Mar 11, 2024 at 12:36:00PM +0000, Ryan Roberts wrote:
> On 11/03/2024 12:26, Matthew Wilcox wrote:
> > On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote:
> >> [  153.499149] Call trace:
> >> [  153.499470]  uncharge_folio+0x1d0/0x2c8
> >> [  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
> >> [  153.500795]  move_folios_to_lru+0x5bc/0x5e0
> >> [  153.501275]  shrink_lruvec+0x5f8/0xb30
> > 
> >> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.
> > 
> > Two sites, but basically the same thing; one is for "the batch is full"
> > and the other is "we finished the list".
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a0e53999a865..f60c5b3977dc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> >  		if (unlikely(folio_put_testzero(folio))) {
> >  			__folio_clear_lru_flags(folio);
> >  
> > +			if (folio_test_large(folio) &&
> > +			    folio_test_large_rmappable(folio))
> > +				folio_undo_large_rmappable(folio);
> >  			if (folio_batch_add(&free_folios, folio) == 0) {
> >  				spin_unlock_irq(&lruvec->lru_lock);
> >  				mem_cgroup_uncharge_folios(&free_folios);
> > 
> >> A quick grep over the entire series has a lot of hits for "uncharge". I
> >> wonder if we need a full audit of that series for other places that
> >> could potentially be doing the same thing?
> > 
> > I think this assertion will catch all occurrences of the same thing,
> > as long as people who are testing are testing in a memcg.  My setup
> > doesn't use a memcg, so I never saw any of this ;-(
> > 
> > If you confirm this fixes it, I'll send two patches; a respin of the patch
> > I sent on Sunday that calls undo_large_rmappable in this one extra place,
> > and then a patch to add the assertions.
> 
> Good timing on your response - I've just finished testing! Although my patch
> included both the site you have above and another that I fixed up speculatively
> in shrink_folio_list() based on reviewing all mem_cgroup_uncharge_folios() call
> sites.

I've been running some tests with this:

+++ b/mm/huge_memory.c
@@ -3223,6 +3221,7 @@ void folio_undo_large_rmappable(struct folio *folio)
        struct deferred_split *ds_queue;
        unsigned long flags;
 
+       folio_clear_large_rmappable(folio);
        if (folio_order(folio) <= 1)
                return;
 
+++ b/mm/memcontrol.c
@@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
        struct obj_cgroup *objcg;

        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+       VM_BUG_ON_FOLIO(folio_test_large(folio) &&
+                       folio_test_large_rmappable(folio), folio);

        /*
         * Nobody should be changing or seriously looking at


which I think is too aggressive to be added.  It does "catch" one spot
where we don't call folio_undo_large_rmappable() in
__filemap_add_folio() but since it's the "we allocated a folio, charged
it, but failed to add it to the pagecache" path, there's no way that
it can have been mmaped, so it can't be on the split list.

With this patch, it took about 700 seconds in my xfstests run to find
the one in shrink_folio_list().  It took 3260 seconds to catch the one
in move_folios_to_lru().

The thought occurs that we know these folios are large rmappable (if
they're large).  It's a little late to make that optimisation, but
during the upcoming development cycle, I'm going to remove that
half of the test.  ie I'll make it look like this:

@@ -1433,6 +1433,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
                 */
                nr_reclaimed += nr_pages;

+               if (folio_test_large(folio))
+                       folio_undo_large_rmappable(folio);
                if (folio_batch_add(&free_folios, folio) == 0) {
                        mem_cgroup_uncharge_folios(&free_folios);
                        try_to_unmap_flush();

> There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which
> I couldn't convince myself was safe. Perhaps you could do a quick audit of the
> call sites?

That one is only probably OK.  We usually manage to split a folio before
we get to this point, so we should remove it from the deferred list then.
You'd need to buy a lottery ticket if you managed to get hwpoison in a
folio that was deferred split ...

I reviewed all the locations that call mem_cgroup_uncharge:

__filemap_add_folio	Never mmapable
free_huge_folio		hugetlb, never splittable
delete_from_lru_cache	Discussed above
free_zone_device_page	compound pages not yet supported
destroy_large_folio	folio_undo_large_rmappable already called
__folio_put_small	not a large folio

Then all the placs that call mem_cgroup_uncharge_folios:

folios_put_refs		Just fixed
shrink_folio_list	Just fixed
move_folios_to_lru	Just fixed

So I think we're good.

> But taking a step back, I wonder whether the charge() and uncharge() functions
> should really be checking to see if the folio is on a deferred split list and if
> so, then move the folio to the corect list?

I don't think that's the right thing to do.  All of these places which
uncharge a folio are part of the freeing path, so we always want it
removed, not moved to a different deferred list.

But what about mem_cgroup_move_account()?  Looks like that's memcg v1
only?  Should still be fixed though ...


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 15:50                         ` Matthew Wilcox
@ 2024-03-11 16:14                           ` Ryan Roberts
  2024-03-11 17:49                             ` Matthew Wilcox
  2024-03-11 19:26                             ` Matthew Wilcox
  0 siblings, 2 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-11 16:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 11/03/2024 15:50, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 12:36:00PM +0000, Ryan Roberts wrote:
>> On 11/03/2024 12:26, Matthew Wilcox wrote:
>>> On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote:
>>>> [  153.499149] Call trace:
>>>> [  153.499470]  uncharge_folio+0x1d0/0x2c8
>>>> [  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
>>>> [  153.500795]  move_folios_to_lru+0x5bc/0x5e0
>>>> [  153.501275]  shrink_lruvec+0x5f8/0xb30
>>>
>>>> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.
>>>
>>> Two sites, but basically the same thing; one is for "the batch is full"
>>> and the other is "we finished the list".
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index a0e53999a865..f60c5b3977dc 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>>>  		if (unlikely(folio_put_testzero(folio))) {
>>>  			__folio_clear_lru_flags(folio);
>>>  
>>> +			if (folio_test_large(folio) &&
>>> +			    folio_test_large_rmappable(folio))
>>> +				folio_undo_large_rmappable(folio);
>>>  			if (folio_batch_add(&free_folios, folio) == 0) {
>>>  				spin_unlock_irq(&lruvec->lru_lock);
>>>  				mem_cgroup_uncharge_folios(&free_folios);
>>>
>>>> A quick grep over the entire series has a lot of hits for "uncharge". I
>>>> wonder if we need a full audit of that series for other places that
>>>> could potentially be doing the same thing?
>>>
>>> I think this assertion will catch all occurrences of the same thing,
>>> as long as people who are testing are testing in a memcg.  My setup
>>> doesn't use a memcg, so I never saw any of this ;-(
>>>
>>> If you confirm this fixes it, I'll send two patches; a respin of the patch
>>> I sent on Sunday that calls undo_large_rmappable in this one extra place,
>>> and then a patch to add the assertions.
>>
>> Good timing on your response - I've just finished testing! Although my patch
>> included both the site you have above and another that I fixed up speculatively
>> in shrink_folio_list() based on reviewing all mem_cgroup_uncharge_folios() call
>> sites.
> 
> I've been running some tests with this:
> 
> +++ b/mm/huge_memory.c
> @@ -3223,6 +3221,7 @@ void folio_undo_large_rmappable(struct folio *folio)
>         struct deferred_split *ds_queue;
>         unsigned long flags;
>  
> +       folio_clear_large_rmappable(folio);
>         if (folio_order(folio) <= 1)
>                 return;
>  
> +++ b/mm/memcontrol.c
> @@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         struct obj_cgroup *objcg;
> 
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> +       VM_BUG_ON_FOLIO(folio_test_large(folio) &&
> +                       folio_test_large_rmappable(folio), folio);
> 
>         /*
>          * Nobody should be changing or seriously looking at
> 
> 
> which I think is too aggressive to be added.  It does "catch" one spot
> where we don't call folio_undo_large_rmappable() in
> __filemap_add_folio() but since it's the "we allocated a folio, charged
> it, but failed to add it to the pagecache" path, there's no way that
> it can have been mmaped, so it can't be on the split list.
> 
> With this patch, it took about 700 seconds in my xfstests run to find
> the one in shrink_folio_list().  It took 3260 seconds to catch the one
> in move_folios_to_lru().

54 hours?? That's before I even reported that we still had a bug! Either you're
anticipating my every move, or you have a lot of stuff running in parallel :)

> 
> The thought occurs that we know these folios are large rmappable (if
> they're large).  It's a little late to make that optimisation, but
> during the upcoming development cycle, I'm going to remove that
> half of the test.  ie I'll make it look like this:
> 
> @@ -1433,6 +1433,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                  */
>                 nr_reclaimed += nr_pages;
> 
> +               if (folio_test_large(folio))
> +                       folio_undo_large_rmappable(folio);
>                 if (folio_batch_add(&free_folios, folio) == 0) {
>                         mem_cgroup_uncharge_folios(&free_folios);
>                         try_to_unmap_flush();
> 
>> There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which
>> I couldn't convince myself was safe. Perhaps you could do a quick audit of the
>> call sites?
> 
> That one is only probably OK.  We usually manage to split a folio before
> we get to this point, so we should remove it from the deferred list then.
> You'd need to buy a lottery ticket if you managed to get hwpoison in a
> folio that was deferred split ...

OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
safe and call folio_undo_large_rmappable()?

> 
> I reviewed all the locations that call mem_cgroup_uncharge:
> 
> __filemap_add_folio	Never mmapable
> free_huge_folio		hugetlb, never splittable
> delete_from_lru_cache	Discussed above
> free_zone_device_page	compound pages not yet supported
> destroy_large_folio	folio_undo_large_rmappable already called
> __folio_put_small	not a large folio
> 
> Then all the placs that call mem_cgroup_uncharge_folios:
> 
> folios_put_refs		Just fixed
> shrink_folio_list	Just fixed
> move_folios_to_lru	Just fixed

OK same conclusion as me, except delete_from_lru_cache().

> 
> So I think we're good.
> 
>> But taking a step back, I wonder whether the charge() and uncharge() functions
>> should really be checking to see if the folio is on a deferred split list and if
>> so, then move the folio to the corect list?
> 
> I don't think that's the right thing to do.  All of these places which
> uncharge a folio are part of the freeing path, so we always want it
> removed, not moved to a different deferred list.

Well I'm just thinking about trying to be robust. Clearly you would prefer that
folio_undo_large_rmappable() has been called before uncharge(), then uncharge()
notices that there is nothing on the deferred list (and doesn't take the lock).
But if it's not, is it better to randomly crash (costing best part of a week to
debug) or move the folio to the right list?

Alternatively, can we refactor so that there aren't 9 separate uncharge() call
sites. Those sites are all trying to free the folio so is there a way to better
refactor that into a single place (I guess the argument for the current
arrangement is reducing the number of times that we have to iterate through the
batch?). Then we only have to get it right once.

> 
> But what about mem_cgroup_move_account()?  Looks like that's memcg v1
> only?  Should still be fixed though ...

Right.

And what about the first bug you found with the local list corruption? I'm not
running with that fix so its obviously not a problem here. But I still think its
a bug that we should fix? list_for_each_entry_safe() isn't safe against
*concurrent* list modification, right?



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 16:14                           ` Ryan Roberts
@ 2024-03-11 17:49                             ` Matthew Wilcox
  2024-03-12 11:57                               ` Ryan Roberts
  2024-03-11 19:26                             ` Matthew Wilcox
  1 sibling, 1 reply; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-11 17:49 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote:
> > With this patch, it took about 700 seconds in my xfstests run to find
> > the one in shrink_folio_list().  It took 3260 seconds to catch the one
> > in move_folios_to_lru().
> 
> 54 hours?? That's before I even reported that we still had a bug! Either you're
> anticipating my every move, or you have a lot of stuff running in parallel :)

One hour is 3600 seconds ;-)  So more like 53 minutes.

> > That one is only probably OK.  We usually manage to split a folio before
> > we get to this point, so we should remove it from the deferred list then.
> > You'd need to buy a lottery ticket if you managed to get hwpoison in a
> > folio that was deferred split ...
> 
> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
> safe and call folio_undo_large_rmappable()?

Yes, I've added it; just commenting on the likelihood of being able to
hit it in testing, even with aggressive error injection.

> >> But taking a step back, I wonder whether the charge() and uncharge() functions
> >> should really be checking to see if the folio is on a deferred split list and if
> >> so, then move the folio to the corect list?
> > 
> > I don't think that's the right thing to do.  All of these places which
> > uncharge a folio are part of the freeing path, so we always want it
> > removed, not moved to a different deferred list.
> 
> Well I'm just thinking about trying to be robust. Clearly you would prefer that
> folio_undo_large_rmappable() has been called before uncharge(), then uncharge()
> notices that there is nothing on the deferred list (and doesn't take the lock).
> But if it's not, is it better to randomly crash (costing best part of a week to
> debug) or move the folio to the right list?

Neither ;-)  The right option is to include the assertion that the
deferred list is empty.  That way we get to see the backtrace of whoever
forgot to take the folio off the deferred list.

> Alternatively, can we refactor so that there aren't 9 separate uncharge() call
> sites. Those sites are all trying to free the folio so is there a way to better
> refactor that into a single place (I guess the argument for the current
> arrangement is reducing the number of times that we have to iterate through the
> batch?). Then we only have to get it right once.

I have been wondering about a better way to do it.  I've also been
looking a bit askance at put_pages_list() which doesn't do memcg
uncharging ...

> > 
> > But what about mem_cgroup_move_account()?  Looks like that's memcg v1
> > only?  Should still be fixed though ...
> 
> Right.
> 
> And what about the first bug you found with the local list corruption? I'm not
> running with that fix so its obviously not a problem here. But I still think its
> a bug that we should fix? list_for_each_entry_safe() isn't safe against
> *concurrent* list modification, right?

I've been thinking about that too.  I decided that the local list is
actually protected by the lock after all.  It's a bit fiddly to prove,
but:

1. We have a reference on every folio ahead on the list (not behind us,
but see below)
2. If split_folio succeeds, it takes the lock that would protect the
list we are on.
3. If it doesn't, and folio_put() turns out to be the last reference,
__folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable
takes the lock that protects the list we would be on.

So we can analyse this loop as:

	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
		if (random() & 1)
			continue;
		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
		list_del_init(&folio->_deferred_list);
		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
	}

We're guaranteed that 'next' is a valid folio because we hold a refcount
on it.  Anything left on the list between &list and next may have been
removed from the list, but we don't look at those entries until after
we take the split_queue_lock again to do the list_splice_tail().

I'm too scared to write a loop like this, but I don't think it contains
a bug.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 16:14                           ` Ryan Roberts
  2024-03-11 17:49                             ` Matthew Wilcox
@ 2024-03-11 19:26                             ` Matthew Wilcox
  1 sibling, 0 replies; 73+ messages in thread
From: Matthew Wilcox @ 2024-03-11 19:26 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Andrew Morton, linux-mm

On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote:
> >> There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which
> >> I couldn't convince myself was safe. Perhaps you could do a quick audit of the
> >> call sites?
> > 
> > That one is only probably OK.  We usually manage to split a folio before
> > we get to this point, so we should remove it from the deferred list then.
> > You'd need to buy a lottery ticket if you managed to get hwpoison in a
> > folio that was deferred split ...
> 
> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
> safe and call folio_undo_large_rmappable()?

Actually, it certainly can't be hit.  Here's the code path:

mem_cgroup_uncharge() is only called from delete_from_lru_cache()

delete_from_lru_cache() is called from me_pagecache_clean(),
me_swapcache_dirty() and me_swapcache_clean() [1]

Those are all called through the error_states dispatch table.
which means they're all called through identify_page_state()

identify_page_state() (other than for hugetlb) is called only from
memory_failure()

memory_failure() calls try_to_split_thp_page() and only calls
identify_page_state() if it succeeds.  ie we cannot be dealing with a
large folio at this point, so we don't need to worry about the deferred
split list.

[1] me not cookie monster me is memory error



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
  2024-03-11 17:49                             ` Matthew Wilcox
@ 2024-03-12 11:57                               ` Ryan Roberts
  0 siblings, 0 replies; 73+ messages in thread
From: Ryan Roberts @ 2024-03-12 11:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm

On 11/03/2024 17:49, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote:
>>> With this patch, it took about 700 seconds in my xfstests run to find
>>> the one in shrink_folio_list().  It took 3260 seconds to catch the one
>>> in move_folios_to_lru().
>>
>> 54 hours?? That's before I even reported that we still had a bug! Either you're
>> anticipating my every move, or you have a lot of stuff running in parallel :)
> 
> One hour is 3600 seconds ;-)  So more like 53 minutes.
> 
>>> That one is only probably OK.  We usually manage to split a folio before
>>> we get to this point, so we should remove it from the deferred list then.
>>> You'd need to buy a lottery ticket if you managed to get hwpoison in a
>>> folio that was deferred split ...
>>
>> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
>> safe and call folio_undo_large_rmappable()?
> 
> Yes, I've added it; just commenting on the likelihood of being able to
> hit it in testing, even with aggressive error injection.
> 
>>>> But taking a step back, I wonder whether the charge() and uncharge() functions
>>>> should really be checking to see if the folio is on a deferred split list and if
>>>> so, then move the folio to the corect list?
>>>
>>> I don't think that's the right thing to do.  All of these places which
>>> uncharge a folio are part of the freeing path, so we always want it
>>> removed, not moved to a different deferred list.
>>
>> Well I'm just thinking about trying to be robust. Clearly you would prefer that
>> folio_undo_large_rmappable() has been called before uncharge(), then uncharge()
>> notices that there is nothing on the deferred list (and doesn't take the lock).
>> But if it's not, is it better to randomly crash (costing best part of a week to
>> debug) or move the folio to the right list?
> 
> Neither ;-)  The right option is to include the assertion that the
> deferred list is empty.  That way we get to see the backtrace of whoever
> forgot to take the folio off the deferred list.
> 
>> Alternatively, can we refactor so that there aren't 9 separate uncharge() call
>> sites. Those sites are all trying to free the folio so is there a way to better
>> refactor that into a single place (I guess the argument for the current
>> arrangement is reducing the number of times that we have to iterate through the
>> batch?). Then we only have to get it right once.
> 
> I have been wondering about a better way to do it.  I've also been
> looking a bit askance at put_pages_list() which doesn't do memcg
> uncharging ...
> 
>>>
>>> But what about mem_cgroup_move_account()?  Looks like that's memcg v1
>>> only?  Should still be fixed though ...
>>
>> Right.
>>
>> And what about the first bug you found with the local list corruption? I'm not
>> running with that fix so its obviously not a problem here. But I still think its
>> a bug that we should fix? list_for_each_entry_safe() isn't safe against
>> *concurrent* list modification, right?
> 
> I've been thinking about that too.  I decided that the local list is
> actually protected by the lock after all.  It's a bit fiddly to prove,
> but:
> 
> 1. We have a reference on every folio ahead on the list (not behind us,
> but see below)
> 2. If split_folio succeeds, it takes the lock that would protect the
> list we are on.
> 3. If it doesn't, and folio_put() turns out to be the last reference,
> __folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable
> takes the lock that protects the list we would be on.
> 
> So we can analyse this loop as:
> 
> 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> 		if (random() & 1)
> 			continue;
> 		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> 		list_del_init(&folio->_deferred_list);
> 		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> 	}
> 
> We're guaranteed that 'next' is a valid folio because we hold a refcount
> on it.  Anything left on the list between &list and next may have been
> removed from the list, but we don't look at those entries until after
> we take the split_queue_lock again to do the list_splice_tail().
> 
> I'm too scared to write a loop like this, but I don't think it contains
> a bug.

OK, wow. Now that I'm looking at the implementation of
list_for_each_entry_safe() along with your reasoning, that is clear. But it's
certainly not obvious looking at deferred_split_scan().



^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2024-03-12 11:57 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 02/18] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 03/18] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 04/18] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 05/18] memcg: Add mem_cgroup_uncharge_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 06/18] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 07/18] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 08/18] mm: use __page_cache_release() in folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 09/18] mm: Handle large folios in free_unref_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
2024-03-06 13:42   ` Ryan Roberts
2024-03-06 16:09     ` Matthew Wilcox
2024-03-06 16:19       ` Ryan Roberts
2024-03-06 17:41         ` Ryan Roberts
2024-03-06 18:41           ` Zi Yan
2024-03-06 19:55             ` Matthew Wilcox
2024-03-06 21:55               ` Matthew Wilcox
2024-03-07  8:56                 ` Ryan Roberts
2024-03-07 13:50                   ` Yin, Fengwei
2024-03-07 14:05                     ` Re: Matthew Wilcox
2024-03-07 15:24                       ` Re: Ryan Roberts
2024-03-07 16:24                         ` Re: Ryan Roberts
2024-03-07 23:02                           ` Re: Matthew Wilcox
2024-03-08  1:06                       ` Re: Yin, Fengwei
2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
2024-03-07 18:35                     ` Ryan Roberts
2024-03-07 20:42                       ` Matthew Wilcox
2024-03-08 11:44                     ` Ryan Roberts
2024-03-08 12:09                       ` Ryan Roberts
2024-03-08 14:21                         ` Ryan Roberts
2024-03-08 15:11                           ` Matthew Wilcox
2024-03-08 16:03                             ` Matthew Wilcox
2024-03-08 17:13                               ` Ryan Roberts
2024-03-08 18:09                                 ` Ryan Roberts
2024-03-08 18:18                                   ` Matthew Wilcox
2024-03-09  4:34                                     ` Andrew Morton
2024-03-09  4:52                                       ` Matthew Wilcox
2024-03-09  8:05                                         ` Ryan Roberts
2024-03-09 12:33                                           ` Ryan Roberts
2024-03-10 13:38                                             ` Matthew Wilcox
2024-03-08 15:33                         ` Matthew Wilcox
2024-03-09  6:09                       ` Matthew Wilcox
2024-03-09  7:59                         ` Ryan Roberts
2024-03-09  8:18                           ` Ryan Roberts
2024-03-09  9:38                             ` Ryan Roberts
2024-03-10  4:23                               ` Matthew Wilcox
2024-03-10  8:23                                 ` Ryan Roberts
2024-03-10 11:08                                   ` Matthew Wilcox
2024-03-10 11:01       ` Ryan Roberts
2024-03-10 11:11         ` Matthew Wilcox
2024-03-10 16:31           ` Ryan Roberts
2024-03-10 19:57             ` Matthew Wilcox
2024-03-10 19:59             ` Ryan Roberts
2024-03-10 20:46               ` Matthew Wilcox
2024-03-10 21:52                 ` Matthew Wilcox
2024-03-11  9:01                   ` Ryan Roberts
2024-03-11 12:26                     ` Matthew Wilcox
2024-03-11 12:36                       ` Ryan Roberts
2024-03-11 15:50                         ` Matthew Wilcox
2024-03-11 16:14                           ` Ryan Roberts
2024-03-11 17:49                             ` Matthew Wilcox
2024-03-12 11:57                               ` Ryan Roberts
2024-03-11 19:26                             ` Matthew Wilcox
2024-03-10 11:14         ` Ryan Roberts
2024-02-27 17:42 ` [PATCH v3 11/18] mm: Free folios in a batch in shrink_folio_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 12/18] mm: Free folios directly in move_folios_to_lru() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 13/18] memcg: Remove mem_cgroup_uncharge_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 14/18] mm: Remove free_unref_page_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 15/18] mm: Remove lru_to_page() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 16/18] mm: Convert free_pages_and_swap_cache() to use folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 17/18] mm: Use a folio in __collapse_huge_page_copy_succeeded() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 18/18] mm: Convert free_swap_cache() to take a folio Matthew Wilcox (Oracle)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.