All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
@ 2022-04-25 14:31 Zi Yan
  2022-04-25 14:31 ` [PATCH v11 1/6] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c Zi Yan
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi David,

This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
and alloc_contig_range(). It prepares for my upcoming changes to make
MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.

Changelog
===
V11
---
1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
   change to a separate patch after the unmovable page check change and
   alloc_contig_range() change to avoid some unwanted memory
   hotplug/hotremove failures.
2. Cleaned up has_unmovable_pages() in Patch 2.

V10
---
1. Reverted back to the original outer_start, outer_end range for
   test_pages_isolated() and isolate_freepages_range() in Patch 3,
   otherwise isolation will fail if start in alloc_contig_range() is in
   the middle of a free page.

V9
---
1. Limited has_unmovable_pages() check within a pageblock.
2. Added a check to ensure page isolation is done within a single zone
   in isolate_single_pageblock().
3. Fixed an off-by-one bug in isolate_single_pageblock().
4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
   is not online in isolate_single_pageblock().

V8
---
1. Cleaned up has_unmovable_pages() to remove page argument.

V7
---
1. Added page validity check in isolate_single_pageblock() to avoid out
   of zone pages.
2. Fixed a bug in split_free_page() to split and free pages in correct
   page order.

V6
---
1. Resolved compilation error/warning reported by kernel test robot.
2. Tried to solve the coding concerns from Christophe Leroy.
3. Shortened lengthy lines (pointed out by Christoph Hellwig).

V5
---
1. Moved isolation address alignment handling in start_isolate_page_range().
2. Rewrote and simplified how alloc_contig_range() works at pageblock
   granularity (Patch 3). Only two pageblock migratetypes need to be saved and
   restored. start_isolate_page_range() might need to migrate pages in this
   version, but it prevents the caller from worrying about
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
   is isolated.

V4
---
1. Dropped two irrelevant patches on non-lru compound page handling, as
   it is not supported upstream.
2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
3. Always check whether two pageblocks can be merged in
   __free_one_page() when order is >= pageblock_order, as the case (not
   mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
3. Moving has_unmovable_pages() is now a separate patch.
4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.

Description
===

The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
isolates pageblocks to remove free memory from buddy allocator but isolating
only a subset of pageblocks within a page spanning across multiple pageblocks
causes free page accounting issues. Isolated page might not be put into the
right free list, since the code assumes the migratetype of the first pageblock
as the whole free page migratetype. This is based on the discussion at [2].

To remove the requirement, this patchset:
1. isolates pages at pageblock granularity instead of
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
2. splits free pages across the specified range or migrates in-use pages
   across the specified range then splits the freed page to avoid free page
   accounting issues (it happens when multiple pageblocks within a single page
   have different migratetypes);
3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
   range during isolation to avoid alloc_contig_range() failure when pageblocks
   within a MAX_ORDER - 1 aligned range are allocated separately.
4. returns pages not in the range as it did before.

One optimization might come later:
1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
   migratetypes when isolation fails in the middle of the range.

Feel free to give comments and suggestions. Thanks.

[1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/

Zi Yan (6):
  mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
  mm: page_isolation: check specified range for unmovable pages
  mm: make alloc_contig_range work at pageblock granularity
  mm: page_isolation: enable arbitrary range page isolation.
  mm: cma: use pageblock_order as the single alignment
  drivers: virtio_mem: use pageblock size as the minimum virtio_mem
    size.

 drivers/virtio/virtio_mem.c    |   6 +-
 include/linux/cma.h            |   4 +-
 include/linux/mmzone.h         |   5 +-
 include/linux/page-isolation.h |   6 +-
 mm/internal.h                  |   6 +
 mm/memory_hotplug.c            |   3 +-
 mm/page_alloc.c                | 191 +++++-------------
 mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
 8 files changed, 392 insertions(+), 174 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v11 1/6] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-04-25 14:31 ` [PATCH v11 2/6] mm: page_isolation: check specified range for unmovable pages Zi Yan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan, Mike Rapoport

From: Zi Yan <ziy@nvidia.com>

has_unmovable_pages() is only used in mm/page_isolation.c. Move it from
mm/page_alloc.c and make it static.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-isolation.h |   2 -
 mm/page_alloc.c                | 119 ---------------------------------
 mm/page_isolation.c            | 119 +++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 121 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..e14eddf6741a 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,8 +33,6 @@ static inline bool is_migrate_isolate(int migratetype)
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
 
-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-				 int migratetype, int flags);
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
 				int migratetype, int *num_movable);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77e8d15523d..ce23ac8ad085 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8886,125 +8886,6 @@ void *__init alloc_large_system_hash(const char *tablename,
 	return table;
 }
 
-/*
- * This function checks whether pageblock includes unmovable pages or not.
- *
- * PageLRU check without isolation or lru_lock could race so that
- * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
- * check without lock_page also may miss some movable non-lru pages at
- * race condition. So you can't expect this function should be exact.
- *
- * Returns a page without holding a reference. If the caller wants to
- * dereference that page (e.g., dumping), it has to make sure that it
- * cannot get removed (e.g., via memory unplug) concurrently.
- *
- */
-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-				 int migratetype, int flags)
-{
-	unsigned long iter = 0;
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long offset = pfn % pageblock_nr_pages;
-
-	if (is_migrate_cma_page(page)) {
-		/*
-		 * CMA allocations (alloc_contig_range) really need to mark
-		 * isolate CMA pageblocks even when they are not movable in fact
-		 * so consider them movable here.
-		 */
-		if (is_migrate_cma(migratetype))
-			return NULL;
-
-		return page;
-	}
-
-	for (; iter < pageblock_nr_pages - offset; iter++) {
-		page = pfn_to_page(pfn + iter);
-
-		/*
-		 * Both, bootmem allocations and memory holes are marked
-		 * PG_reserved and are unmovable. We can even have unmovable
-		 * allocations inside ZONE_MOVABLE, for example when
-		 * specifying "movablecore".
-		 */
-		if (PageReserved(page))
-			return page;
-
-		/*
-		 * If the zone is movable and we have ruled out all reserved
-		 * pages then it should be reasonably safe to assume the rest
-		 * is movable.
-		 */
-		if (zone_idx(zone) == ZONE_MOVABLE)
-			continue;
-
-		/*
-		 * Hugepages are not in LRU lists, but they're movable.
-		 * THPs are on the LRU, but need to be counted as #small pages.
-		 * We need not scan over tail pages because we don't
-		 * handle each tail page individually in migration.
-		 */
-		if (PageHuge(page) || PageTransCompound(page)) {
-			struct page *head = compound_head(page);
-			unsigned int skip_pages;
-
-			if (PageHuge(page)) {
-				if (!hugepage_migration_supported(page_hstate(head)))
-					return page;
-			} else if (!PageLRU(head) && !__PageMovable(head)) {
-				return page;
-			}
-
-			skip_pages = compound_nr(head) - (page - head);
-			iter += skip_pages - 1;
-			continue;
-		}
-
-		/*
-		 * We can't use page_count without pin a page
-		 * because another CPU can free compound page.
-		 * This check already skips compound tails of THP
-		 * because their page->_refcount is zero at all time.
-		 */
-		if (!page_ref_count(page)) {
-			if (PageBuddy(page))
-				iter += (1 << buddy_order(page)) - 1;
-			continue;
-		}
-
-		/*
-		 * The HWPoisoned page may be not in buddy system, and
-		 * page_count() is not 0.
-		 */
-		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
-			continue;
-
-		/*
-		 * We treat all PageOffline() pages as movable when offlining
-		 * to give drivers a chance to decrement their reference count
-		 * in MEM_GOING_OFFLINE in order to indicate that these pages
-		 * can be offlined as there are no direct references anymore.
-		 * For actually unmovable PageOffline() where the driver does
-		 * not support this, we will fail later when trying to actually
-		 * move these pages that still have a reference count > 0.
-		 * (false negatives in this function only)
-		 */
-		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
-			continue;
-
-		if (__PageMovable(page) || PageLRU(page))
-			continue;
-
-		/*
-		 * If there are RECLAIMABLE pages, we need to check
-		 * it.  But now, memory offline itself doesn't call
-		 * shrink_node_slabs() and it still to be fixed.
-		 */
-		return page;
-	}
-	return NULL;
-}
-
 #ifdef CONFIG_CONTIG_ALLOC
 static unsigned long pfn_max_align_down(unsigned long pfn)
 {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ff0ea6308299..df49f86a6ed1 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -15,6 +15,125 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/page_isolation.h>
 
+/*
+ * This function checks whether pageblock includes unmovable pages or not.
+ *
+ * PageLRU check without isolation or lru_lock could race so that
+ * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
+ * check without lock_page also may miss some movable non-lru pages at
+ * race condition. So you can't expect this function should be exact.
+ *
+ * Returns a page without holding a reference. If the caller wants to
+ * dereference that page (e.g., dumping), it has to make sure that it
+ * cannot get removed (e.g., via memory unplug) concurrently.
+ *
+ */
+static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
+				 int migratetype, int flags)
+{
+	unsigned long iter = 0;
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset = pfn % pageblock_nr_pages;
+
+	if (is_migrate_cma_page(page)) {
+		/*
+		 * CMA allocations (alloc_contig_range) really need to mark
+		 * isolate CMA pageblocks even when they are not movable in fact
+		 * so consider them movable here.
+		 */
+		if (is_migrate_cma(migratetype))
+			return NULL;
+
+		return page;
+	}
+
+	for (; iter < pageblock_nr_pages - offset; iter++) {
+		page = pfn_to_page(pfn + iter);
+
+		/*
+		 * Both, bootmem allocations and memory holes are marked
+		 * PG_reserved and are unmovable. We can even have unmovable
+		 * allocations inside ZONE_MOVABLE, for example when
+		 * specifying "movablecore".
+		 */
+		if (PageReserved(page))
+			return page;
+
+		/*
+		 * If the zone is movable and we have ruled out all reserved
+		 * pages then it should be reasonably safe to assume the rest
+		 * is movable.
+		 */
+		if (zone_idx(zone) == ZONE_MOVABLE)
+			continue;
+
+		/*
+		 * Hugepages are not in LRU lists, but they're movable.
+		 * THPs are on the LRU, but need to be counted as #small pages.
+		 * We need not scan over tail pages because we don't
+		 * handle each tail page individually in migration.
+		 */
+		if (PageHuge(page) || PageTransCompound(page)) {
+			struct page *head = compound_head(page);
+			unsigned int skip_pages;
+
+			if (PageHuge(page)) {
+				if (!hugepage_migration_supported(page_hstate(head)))
+					return page;
+			} else if (!PageLRU(head) && !__PageMovable(head)) {
+				return page;
+			}
+
+			skip_pages = compound_nr(head) - (page - head);
+			iter += skip_pages - 1;
+			continue;
+		}
+
+		/*
+		 * We can't use page_count without pin a page
+		 * because another CPU can free compound page.
+		 * This check already skips compound tails of THP
+		 * because their page->_refcount is zero at all time.
+		 */
+		if (!page_ref_count(page)) {
+			if (PageBuddy(page))
+				iter += (1 << buddy_order(page)) - 1;
+			continue;
+		}
+
+		/*
+		 * The HWPoisoned page may be not in buddy system, and
+		 * page_count() is not 0.
+		 */
+		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
+			continue;
+
+		/*
+		 * We treat all PageOffline() pages as movable when offlining
+		 * to give drivers a chance to decrement their reference count
+		 * in MEM_GOING_OFFLINE in order to indicate that these pages
+		 * can be offlined as there are no direct references anymore.
+		 * For actually unmovable PageOffline() where the driver does
+		 * not support this, we will fail later when trying to actually
+		 * move these pages that still have a reference count > 0.
+		 * (false negatives in this function only)
+		 */
+		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
+			continue;
+
+		if (__PageMovable(page) || PageLRU(page))
+			continue;
+
+		/*
+		 * If there are RECLAIMABLE pages, we need to check
+		 * it.  But now, memory offline itself doesn't call
+		 * shrink_node_slabs() and it still to be fixed.
+		 */
+		return page;
+	}
+	return NULL;
+}
+
 static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
 {
 	struct zone *zone = page_zone(page);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v11 2/6] mm: page_isolation: check specified range for unmovable pages
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
  2022-04-25 14:31 ` [PATCH v11 1/6] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-04-25 14:31 ` [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity Zi Yan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Enable set_migratetype_isolate() to check specified range for
unmovable pages during isolation to prepare arbitrary range page
isolation. The functionality will take effect in upcoming commits by
adjusting the callers of start_isolate_page_range(), which uses
set_migratetype_isolate().

For example, alloc_contig_range(), which calls start_isolate_page_range(),
accepts unaligned ranges, but because page isolation is currently done at
MAX_ORDER_NR_PAEGS granularity, pages that are out of the specified range
but withint MAX_ORDER_NR_PAEGS alignment might be attempted for isolation
and the failure of isolating these unrelated pages fails the whole
operation undesirably.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_isolation.c | 47 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index df49f86a6ed1..c2f7a8bb634d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -16,7 +16,9 @@
 #include <trace/events/page_isolation.h>
 
 /*
- * This function checks whether pageblock includes unmovable pages or not.
+ * This function checks whether the range [start_pfn, end_pfn) includes
+ * unmovable pages or not. The range must fall into a single pageblock and
+ * consequently belong to a single zone.
  *
  * PageLRU check without isolation or lru_lock could race so that
  * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
@@ -28,12 +30,15 @@
  * cannot get removed (e.g., via memory unplug) concurrently.
  *
  */
-static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-				 int migratetype, int flags)
+static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long end_pfn,
+				int migratetype, int flags)
 {
-	unsigned long iter = 0;
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long offset = pfn % pageblock_nr_pages;
+	struct page *page = pfn_to_page(start_pfn);
+	struct zone *zone = page_zone(page);
+	unsigned long pfn;
+
+	VM_BUG_ON(ALIGN_DOWN(start_pfn, pageblock_nr_pages) !=
+		  ALIGN_DOWN(end_pfn - 1, pageblock_nr_pages));
 
 	if (is_migrate_cma_page(page)) {
 		/*
@@ -47,8 +52,8 @@ static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 		return page;
 	}
 
-	for (; iter < pageblock_nr_pages - offset; iter++) {
-		page = pfn_to_page(pfn + iter);
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		page = pfn_to_page(pfn);
 
 		/*
 		 * Both, bootmem allocations and memory holes are marked
@@ -85,7 +90,7 @@ static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 			}
 
 			skip_pages = compound_nr(head) - (page - head);
-			iter += skip_pages - 1;
+			pfn += skip_pages - 1;
 			continue;
 		}
 
@@ -97,7 +102,7 @@ static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 		 */
 		if (!page_ref_count(page)) {
 			if (PageBuddy(page))
-				iter += (1 << buddy_order(page)) - 1;
+				pfn += (1 << buddy_order(page)) - 1;
 			continue;
 		}
 
@@ -134,11 +139,18 @@ static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 	return NULL;
 }
 
-static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
+/*
+ * This function set pageblock migratetype to isolate if no unmovable page is
+ * present in [start_pfn, end_pfn). The pageblock must intersect with
+ * [start_pfn, end_pfn).
+ */
+static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags,
+			unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct zone *zone = page_zone(page);
 	struct page *unmovable;
 	unsigned long flags;
+	unsigned long check_unmovable_start, check_unmovable_end;
 
 	spin_lock_irqsave(&zone->lock, flags);
 
@@ -155,8 +167,16 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	/*
 	 * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
 	 * We just check MOVABLE pages.
+	 *
+	 * Pass the intersection of [start_pfn, end_pfn) and the page's pageblock
+	 * to avoid redundant checks.
 	 */
-	unmovable = has_unmovable_pages(zone, page, migratetype, isol_flags);
+	check_unmovable_start = max(page_to_pfn(page), start_pfn);
+	check_unmovable_end = min(ALIGN(page_to_pfn(page) + 1, pageblock_nr_pages),
+				  end_pfn);
+
+	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
+			migratetype, isol_flags);
 	if (!unmovable) {
 		unsigned long nr_pages;
 		int mt = get_pageblock_migratetype(page);
@@ -313,7 +333,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	     pfn < end_pfn;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
-		if (page && set_migratetype_isolate(page, migratetype, flags)) {
+		if (page && set_migratetype_isolate(page, migratetype, flags,
+					start_pfn, end_pfn)) {
 			undo_isolate_page_range(start_pfn, pfn, migratetype);
 			return -EBUSY;
 		}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
  2022-04-25 14:31 ` [PATCH v11 1/6] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c Zi Yan
  2022-04-25 14:31 ` [PATCH v11 2/6] mm: page_isolation: check specified range for unmovable pages Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-04-29 13:54   ` Zi Yan
  2022-04-25 14:31 ` [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation Zi Yan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan, kernel test robot

From: Zi Yan <ziy@nvidia.com>

alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
merging pageblocks with different migratetypes. It might unnecessarily
convert extra pageblocks at the beginning and at the end of the range.
Change alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified by alloc_contig_range(). Because these
partially isolated pages causes free page accounting issues. The free
pages will be split and freed into separate migratetype lists; the
in-use pages will be migrated then the freed pages will be handled in
the aforementioned way.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |   4 +-
 mm/internal.h                  |   6 ++
 mm/memory_hotplug.c            |   3 +-
 mm/page_alloc.c                |  54 ++++++++--
 mm/page_isolation.c            | 184 ++++++++++++++++++++++++++++++++-
 5 files changed, 233 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index e14eddf6741a..5456b7be38ae 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned migratetype, int flags);
+			 int migratetype, int flags, gfp_t gfp_flags);
 
 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
@@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  */
 void
 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			unsigned migratetype);
+			int migratetype);
 
 /*
  * Test all pages in [start_pfn, end_pfn) are isolated or not.
diff --git a/mm/internal.h b/mm/internal.h
index 919fa07e1031..0667abd57634 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 			  phys_addr_t min_addr,
 			  int nid, bool exact_nid);
 
+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
@@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
 int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
+
+int __alloc_contig_migrate_range(struct compact_control *cc,
+					unsigned long start, unsigned long end);
 #endif
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4c6065e5d274..9f8ae4cb77ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
-				       MEMORY_OFFLINE | REPORT_FAILURE);
+				       MEMORY_OFFLINE | REPORT_FAILURE,
+				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
 	if (ret) {
 		reason = "failure to isolate range";
 		goto failed_removal_pcplists_disabled;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce23ac8ad085..70ddd9a0bcf3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
 		page_reporting_notify_free(order);
 }
 
+/**
+ * split_free_page() -- split a free page at split_pfn_offset
+ * @free_page:		the original free page
+ * @order:		the order of the page
+ * @split_pfn_offset:	split offset within the page
+ *
+ * It is used when the free page crosses two pageblocks with different migratetypes
+ * at split_pfn_offset within the page. The split free page will be put into
+ * separate migratetype lists afterwards. Otherwise, the function achieves
+ * nothing.
+ */
+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset)
+{
+	struct zone *zone = page_zone(free_page);
+	unsigned long free_page_pfn = page_to_pfn(free_page);
+	unsigned long pfn;
+	unsigned long flags;
+	int free_page_order;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	del_page_from_free_list(free_page, zone, order);
+	for (pfn = free_page_pfn;
+	     pfn < free_page_pfn + (1UL << order);) {
+		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
+
+		free_page_order = ffs(split_pfn_offset) - 1;
+		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
+				mt, FPI_NONE);
+		pfn += 1UL << free_page_order;
+		split_pfn_offset -= (1UL << free_page_order);
+		/* we have done the first part, now switch to second part */
+		if (split_pfn_offset == 0)
+			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
 #endif
 
 /* [start, end) must belong to a single zone. */
-static int __alloc_contig_migrate_range(struct compact_control *cc,
+int __alloc_contig_migrate_range(struct compact_control *cc,
 					unsigned long start, unsigned long end)
 {
 	/* This function is based on compact_zone() from compaction.c. */
@@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	unsigned int order;
+	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * What we do here is we mark all pageblocks in range as
 	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
 	 * have different sizes, and due to the way page allocator
-	 * work, we align the range to biggest of the two pages so
-	 * that page allocator won't try to merge buddies from
-	 * different pageblocks and change MIGRATE_ISOLATE to some
-	 * other migration type.
+	 * work, start_isolate_page_range() has special handlings for this.
 	 *
 	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
 	 * migrate the pages from an unaligned range (ie. pages that
-	 * we are interested in).  This will put all the pages in
+	 * we are interested in). This will put all the pages in
 	 * range back to page allocator as MIGRATE_ISOLATE.
 	 *
 	 * When this is done, we take the pages in range from page
@@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	ret = start_isolate_page_range(pfn_max_align_down(start),
-				       pfn_max_align_up(end), migratetype, 0);
+				pfn_max_align_up(end), migratetype, 0, gfp_mask);
 	if (ret)
-		return ret;
+		goto done;
 
 	drain_all_pages(cc.zone);
 
@@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	ret = 0;
 
 	/*
-	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
+	 * Pages from [start, end) are within a pageblock_nr_pages
 	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
 	 * more, all pages in [start, end) are free in page allocator.
 	 * What we are going to do is to allocate all pages from
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c2f7a8bb634d..94b3467e5ba2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	return -EBUSY;
 }
 
-static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
+static void unset_migratetype_isolate(struct page *page, int migratetype)
 {
 	struct zone *zone;
 	unsigned long flags, nr_pages;
@@ -279,6 +279,157 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 	return NULL;
 }
 
+/**
+ * isolate_single_pageblock() -- tries to isolate a pageblock that might be
+ * within a free or in-use page.
+ * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @gfp_flags:			GFP flags used for migrating pages
+ * @isolate_before:	isolate the pageblock before the boundary_pfn
+ *
+ * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
+ * pageblock. When not all pageblocks within a page are isolated at the same
+ * time, free page accounting can go wrong. For example, in the case of
+ * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
+ * [         MAX_ORDER-1         ]
+ * [  pageblock0  |  pageblock1  ]
+ * When either pageblock is isolated, if it is a free page, the page is not
+ * split into separate migratetype lists, which is supposed to; if it is an
+ * in-use page and freed later, __free_one_page() does not split the free page
+ * either. The function handles this by splitting the free page or migrating
+ * the in-use page then splitting the free page.
+ */
+static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
+			bool isolate_before)
+{
+	unsigned char saved_mt;
+	unsigned long start_pfn;
+	unsigned long isolate_pageblock;
+	unsigned long pfn;
+	struct zone *zone;
+
+	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
+
+	if (isolate_before)
+		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
+	else
+		isolate_pageblock = boundary_pfn;
+
+	/*
+	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
+	 * only isolating a subset of pageblocks from a bigger than pageblock
+	 * free or in-use page. Also make sure all to-be-isolated pageblocks
+	 * are within the same zone.
+	 */
+	zone  = page_zone(pfn_to_page(isolate_pageblock));
+	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
+				      zone->zone_start_pfn);
+
+	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
+	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
+
+	/*
+	 * Bail out early when the to-be-isolated pageblock does not form
+	 * a free or in-use page across boundary_pfn:
+	 *
+	 * 1. isolate before boundary_pfn: the page after is not online
+	 * 2. isolate after boundary_pfn: the page before is not online
+	 *
+	 * This also ensures correctness. Without it, when isolate after
+	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
+	 * __first_valid_page() will return unexpected NULL in the for loop
+	 * below.
+	 */
+	if (isolate_before) {
+		if (!pfn_to_online_page(boundary_pfn))
+			return 0;
+	} else {
+		if (!pfn_to_online_page(boundary_pfn - 1))
+			return 0;
+	}
+
+	for (pfn = start_pfn; pfn < boundary_pfn;) {
+		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
+
+		VM_BUG_ON(!page);
+		pfn = page_to_pfn(page);
+		/*
+		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
+		 * free pages in [start_pfn, boundary_pfn), its head page will
+		 * always be in the range.
+		 */
+		if (PageBuddy(page)) {
+			int order = buddy_order(page);
+
+			if (pfn + (1UL << order) > boundary_pfn)
+				split_free_page(page, order, boundary_pfn - pfn);
+			pfn += (1UL << order);
+			continue;
+		}
+		/*
+		 * migrate compound pages then let the free page handling code
+		 * above do the rest. If migration is not enabled, just fail.
+		 */
+		if (PageHuge(page) || PageTransCompound(page)) {
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+			unsigned long nr_pages = compound_nr(page);
+			int order = compound_order(page);
+			struct page *head = compound_head(page);
+			unsigned long head_pfn = page_to_pfn(head);
+			int ret;
+			struct compact_control cc = {
+				.nr_migratepages = 0,
+				.order = -1,
+				.zone = page_zone(pfn_to_page(head_pfn)),
+				.mode = MIGRATE_SYNC,
+				.ignore_skip_hint = true,
+				.no_set_skip_hint = true,
+				.gfp_mask = gfp_flags,
+				.alloc_contig = true,
+			};
+			INIT_LIST_HEAD(&cc.migratepages);
+
+			if (head_pfn + nr_pages < boundary_pfn) {
+				pfn += nr_pages;
+				continue;
+			}
+
+			ret = __alloc_contig_migrate_range(&cc, head_pfn,
+						head_pfn + nr_pages);
+
+			if (ret)
+				goto failed;
+			/*
+			 * reset pfn, let the free page handling code above
+			 * split the free page to the right migratetype list.
+			 *
+			 * head_pfn is not used here as a hugetlb page order
+			 * can be bigger than MAX_ORDER-1, but after it is
+			 * freed, the free page order is not. Use pfn within
+			 * the range to find the head of the free page and
+			 * reset order to 0 if a hugetlb page with
+			 * >MAX_ORDER-1 order is encountered.
+			 */
+			if (order > MAX_ORDER-1)
+				order = 0;
+			while (!PageBuddy(pfn_to_page(pfn))) {
+				order++;
+				pfn &= ~0UL << order;
+			}
+			continue;
+#else
+			goto failed;
+#endif
+		}
+
+		pfn++;
+	}
+	return 0;
+failed:
+	/* restore the original migratetype */
+	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
+	return -EBUSY;
+}
+
 /**
  * start_isolate_page_range() - make page-allocation-type of range of pages to
  * be MIGRATE_ISOLATE.
@@ -293,6 +444,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *					 and PageOffline() pages.
  *			REPORT_FAILURE - report details about the failure to
  *			isolate the range
+ * @gfp_flags:		GFP flags used for migrating pages that sit across the
+ *			range boundaries.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -301,6 +454,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * pages in the range finally, the caller have to free all pages in the range.
  * test_page_isolated() can be used for test it.
  *
+ * The function first tries to isolate the pageblocks at the beginning and end
+ * of the range, since there might be pages across the range boundaries.
+ * Afterwards, it isolates the rest of the range.
+ *
  * There is no high level synchronization mechanism that prevents two threads
  * from trying to isolate overlapping ranges. If this happens, one thread
  * will notice pageblocks in the overlapping range already set to isolate.
@@ -321,21 +478,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			     unsigned migratetype, int flags)
+			     int migratetype, int flags, gfp_t gfp_flags)
 {
 	unsigned long pfn;
 	struct page *page;
+	int ret;
 
 	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
 	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
 
-	for (pfn = start_pfn;
-	     pfn < end_pfn;
+	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
+	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
+	if (ret)
+		return ret;
+
+	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
+	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
+	if (ret) {
+		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
+		return ret;
+	}
+
+	/* skip isolated pageblocks at the beginning and end */
+	for (pfn = start_pfn + pageblock_nr_pages;
+	     pfn < end_pfn - pageblock_nr_pages;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			unset_migratetype_isolate(
+				pfn_to_page(end_pfn - pageblock_nr_pages),
+				migratetype);
 			return -EBUSY;
 		}
 	}
@@ -346,7 +520,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  * Make isolated pages available again.
  */
 void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			    unsigned migratetype)
+			    int migratetype)
 {
 	unsigned long pfn;
 	struct page *page;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation.
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
                   ` (2 preceding siblings ...)
  2022-04-25 14:31 ` [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-05-24 19:02   ` Zi Yan
  2022-04-25 14:31 ` [PATCH v11 5/6] mm: cma: use pageblock_order as the single alignment Zi Yan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Now start_isolate_page_range() is ready to handle arbitrary range
isolation, so move the alignment check/adjustment into the function body.
Do the same for its counterpart undo_isolate_page_range().
alloc_contig_range(), its caller, can pass an arbitrary range instead of
a MAX_ORDER_NR_PAGES aligned one.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 16 ++--------------
 mm/page_isolation.c | 33 ++++++++++++++++-----------------
 2 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70ddd9a0bcf3..a002cf12eb6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8924,16 +8924,6 @@ void *__init alloc_large_system_hash(const char *tablename,
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static unsigned long pfn_max_align_down(unsigned long pfn)
-{
-	return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
-}
-
-static unsigned long pfn_max_align_up(unsigned long pfn)
-{
-	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
-}
-
 #if defined(CONFIG_DYNAMIC_DEBUG) || \
 	(defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
@@ -9075,8 +9065,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * put back to page allocator so that buddy can use them.
 	 */
 
-	ret = start_isolate_page_range(pfn_max_align_down(start),
-				pfn_max_align_up(end), migratetype, 0, gfp_mask);
+	ret = start_isolate_page_range(start, end, migratetype, 0, gfp_mask);
 	if (ret)
 		goto done;
 
@@ -9157,8 +9146,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		free_contig_range(end, outer_end - end);
 
 done:
-	undo_isolate_page_range(pfn_max_align_down(start),
-				pfn_max_align_up(end), migratetype);
+	undo_isolate_page_range(start, end, migratetype);
 	return ret;
 }
 EXPORT_SYMBOL(alloc_contig_range);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 94b3467e5ba2..75e454f5cf45 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -435,7 +435,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
  * be MIGRATE_ISOLATE.
  * @start_pfn:		The lower PFN of the range to be isolated.
  * @end_pfn:		The upper PFN of the range to be isolated.
- *			start_pfn/end_pfn must be aligned to pageblock_order.
  * @migratetype:	Migrate type to set in error recovery.
  * @flags:		The following flags are allowed (they can be combined in
  *			a bit mask)
@@ -482,33 +481,33 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
+	/* isolation is done at page block granularity */
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
+	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
 	int ret;
 
-	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
-	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
-
-	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
+	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
+	ret = isolate_single_pageblock(isolate_start, gfp_flags, false);
 	if (ret)
 		return ret;
 
-	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
-	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
+	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+	ret = isolate_single_pageblock(isolate_end, gfp_flags, true);
 	if (ret) {
-		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
+		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;
 	}
 
 	/* skip isolated pageblocks at the beginning and end */
-	for (pfn = start_pfn + pageblock_nr_pages;
-	     pfn < end_pfn - pageblock_nr_pages;
+	for (pfn = isolate_start + pageblock_nr_pages;
+	     pfn < isolate_end - pageblock_nr_pages;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
-			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			undo_isolate_page_range(isolate_start, pfn, migratetype);
 			unset_migratetype_isolate(
-				pfn_to_page(end_pfn - pageblock_nr_pages),
+				pfn_to_page(isolate_end - pageblock_nr_pages),
 				migratetype);
 			return -EBUSY;
 		}
@@ -524,12 +523,12 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
+	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
 
-	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
-	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
 
-	for (pfn = start_pfn;
-	     pfn < end_pfn;
+	for (pfn = isolate_start;
+	     pfn < isolate_end;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (!page || !is_migrate_isolate_page(page))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v11 5/6] mm: cma: use pageblock_order as the single alignment
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
                   ` (3 preceding siblings ...)
  2022-04-25 14:31 ` [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-04-25 14:31 ` [PATCH v11 6/6] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size Zi Yan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Now alloc_contig_range() works at pageblock granularity. Change CMA
allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
alignment.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/cma.h    | 4 ++--
 include/linux/mmzone.h | 5 +----
 mm/page_alloc.c        | 4 ++--
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index a6f637342740..63873b93deaa 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -17,11 +17,11 @@
 #define CMA_MAX_NAME 64
 
 /*
- * TODO: once the buddy -- especially pageblock merging and alloc_contig_range()
+ *  the buddy -- especially pageblock merging and alloc_contig_range()
  * -- can deal with only some pageblocks of a higher-order page being
  *  MIGRATE_CMA, we can use pageblock_nr_pages.
  */
-#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES
+#define CMA_MIN_ALIGNMENT_PAGES pageblock_nr_pages
 #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
 
 struct cma;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46ffab808f03..aab70355d64f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -54,10 +54,7 @@ enum migratetype {
 	 *
 	 * The way to use it is to change migratetype of a range of
 	 * pageblocks to MIGRATE_CMA which can be done by
-	 * __free_pageblock_cma() function.  What is important though
-	 * is that a range of pageblocks must be aligned to
-	 * MAX_ORDER_NR_PAGES should biggest page be bigger than
-	 * a single pageblock.
+	 * __free_pageblock_cma() function.
 	 */
 	MIGRATE_CMA,
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a002cf12eb6c..bc9e129ab3d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9014,8 +9014,8 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
  *			be either of the two.
  * @gfp_mask:	GFP mask to use during compaction
  *
- * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
- * aligned.  The PFN range must belong to a single zone.
+ * The PFN range does not have to be pageblock aligned. The PFN range must
+ * belong to a single zone.
  *
  * The first thing this routine does is attempt to MIGRATE_ISOLATE all
  * pageblocks in the range.  Once isolated, the pageblocks should not
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v11 6/6] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
                   ` (4 preceding siblings ...)
  2022-04-25 14:31 ` [PATCH v11 5/6] mm: cma: use pageblock_order as the single alignment Zi Yan
@ 2022-04-25 14:31 ` Zi Yan
  2022-04-26 20:18 ` [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Qian Cai
  2022-05-10  1:03   ` Andrew Morton
  7 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-25 14:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, Zi Yan

From: Zi Yan <ziy@nvidia.com>

alloc_contig_range() now only needs to be aligned to pageblock_nr_pages,
drop virtio_mem size requirement that it needs to be MAX_ORDER_NR_PAGES.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 drivers/virtio/virtio_mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index e7d6b679596d..e07486f01999 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2476,10 +2476,10 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 				      VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
 	/*
-	 * TODO: once alloc_contig_range() works reliably with pageblock
-	 * granularity on ZONE_NORMAL, use pageblock_nr_pages instead.
+	 * alloc_contig_range() works reliably with pageblock
+	 * granularity on ZONE_NORMAL, use pageblock_nr_pages.
 	 */
-	sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES;
+	sb_size = PAGE_SIZE * pageblock_nr_pages;
 	sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
 	if (sb_size < memory_block_size_bytes() && !force_bbm) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
                   ` (5 preceding siblings ...)
  2022-04-25 14:31 ` [PATCH v11 6/6] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size Zi Yan
@ 2022-04-26 20:18 ` Qian Cai
  2022-04-26 20:26   ` Zi Yan
  2022-05-10  1:03   ` Andrew Morton
  7 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-04-26 20:18 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Mon, Apr 25, 2022 at 10:31:12AM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi David,
> 
> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.
> 
> Changelog
> ===
> V11
> ---
> 1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
>    change to a separate patch after the unmovable page check change and
>    alloc_contig_range() change to avoid some unwanted memory
>    hotplug/hotremove failures.
> 2. Cleaned up has_unmovable_pages() in Patch 2.
> 
> V10
> ---
> 1. Reverted back to the original outer_start, outer_end range for
>    test_pages_isolated() and isolate_freepages_range() in Patch 3,
>    otherwise isolation will fail if start in alloc_contig_range() is in
>    the middle of a free page.
> 
> V9
> ---
> 1. Limited has_unmovable_pages() check within a pageblock.
> 2. Added a check to ensure page isolation is done within a single zone
>    in isolate_single_pageblock().
> 3. Fixed an off-by-one bug in isolate_single_pageblock().
> 4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
>    is not online in isolate_single_pageblock().
> 
> V8
> ---
> 1. Cleaned up has_unmovable_pages() to remove page argument.
> 
> V7
> ---
> 1. Added page validity check in isolate_single_pageblock() to avoid out
>    of zone pages.
> 2. Fixed a bug in split_free_page() to split and free pages in correct
>    page order.
> 
> V6
> ---
> 1. Resolved compilation error/warning reported by kernel test robot.
> 2. Tried to solve the coding concerns from Christophe Leroy.
> 3. Shortened lengthy lines (pointed out by Christoph Hellwig).
> 
> V5
> ---
> 1. Moved isolation address alignment handling in start_isolate_page_range().
> 2. Rewrote and simplified how alloc_contig_range() works at pageblock
>    granularity (Patch 3). Only two pageblock migratetypes need to be saved and
>    restored. start_isolate_page_range() might need to migrate pages in this
>    version, but it prevents the caller from worrying about
>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
>    is isolated.
> 
> V4
> ---
> 1. Dropped two irrelevant patches on non-lru compound page handling, as
>    it is not supported upstream.
> 2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
> 3. Always check whether two pageblocks can be merged in
>    __free_one_page() when order is >= pageblock_order, as the case (not
>    mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
> 3. Moving has_unmovable_pages() is now a separate patch.
> 4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.
> 
> Description
> ===
> 
> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
> isolates pageblocks to remove free memory from buddy allocator but isolating
> only a subset of pageblocks within a page spanning across multiple pageblocks
> causes free page accounting issues. Isolated page might not be put into the
> right free list, since the code assumes the migratetype of the first pageblock
> as the whole free page migratetype. This is based on the discussion at [2].
> 
> To remove the requirement, this patchset:
> 1. isolates pages at pageblock granularity instead of
>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
> 2. splits free pages across the specified range or migrates in-use pages
>    across the specified range then splits the freed page to avoid free page
>    accounting issues (it happens when multiple pageblocks within a single page
>    have different migratetypes);
> 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
>    range during isolation to avoid alloc_contig_range() failure when pageblocks
>    within a MAX_ORDER - 1 aligned range are allocated separately.
> 4. returns pages not in the range as it did before.
> 
> One optimization might come later:
> 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
>    migratetypes when isolation fails in the middle of the range.
> 
> Feel free to give comments and suggestions. Thanks.
> 
> [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
> [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/
> 
> Zi Yan (6):
>   mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
>   mm: page_isolation: check specified range for unmovable pages
>   mm: make alloc_contig_range work at pageblock granularity
>   mm: page_isolation: enable arbitrary range page isolation.
>   mm: cma: use pageblock_order as the single alignment
>   drivers: virtio_mem: use pageblock size as the minimum virtio_mem
>     size.
> 
>  drivers/virtio/virtio_mem.c    |   6 +-
>  include/linux/cma.h            |   4 +-
>  include/linux/mmzone.h         |   5 +-
>  include/linux/page-isolation.h |   6 +-
>  mm/internal.h                  |   6 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/page_alloc.c                | 191 +++++-------------
>  mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
>  8 files changed, 392 insertions(+), 174 deletions(-)

Reverting this series fixed a deadlock during memory offline/online
tests and then a crash.

 INFO: task kmemleak:1027 blocked for more than 120 seconds.
       Not tainted 5.18.0-rc4-next-20220426-dirty #27
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kmemleak        state:D stack:27744 pid: 1027 ppid:     2 flags:0x00000008
 Call trace:
  __switch_to
  __schedule
  schedule
  percpu_rwsem_wait
  __percpu_down_read
  percpu_down_read.constprop.0
  get_online_mems
  kmemleak_scan
  kmemleak_scan_thread
  kthread
  ret_from_fork

 Showing all locks held in the system:
 1 lock held by rcu_tasks_kthre/11:
  #0: ffffc1e2cefc17f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by rcu_tasks_rude_/12:
  #0: ffffc1e2cefc1a90 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by rcu_tasks_trace/13:
  #0: ffffc1e2cefc1db0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by khungtaskd/824:
  #0: ffffc1e2cefc2820 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks
 2 locks held by kmemleak/1027:
  #0: ffffc1e2cf1aa628 (scan_mutex){+.+.}-{3:3}, at: kmemleak_scan_thread
  #1: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: get_online_mems
 2 locks held by cppc_fie/1805:
 1 lock held by in:imklog/2822:
 8 locks held by tee/3334:
  #0: ffff0816d65c9438 (sb_writers#6){.+.+}-{0:0}, at: vfs_write
  #1: ffff40025438be88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter
  #2: ffff4000c8261eb0 (kn->active#298){.+.+}-{0:0}, at: kernfs_fop_write_iter
  #3: ffffc1e2d0013f68 (device_hotplug_lock){+.+.}-{3:3}, at: online_store
  #4: ffff0800cd8bb998 (&dev->mutex){....}-{3:3}, at: device_offline
  #5: ffffc1e2ceed3750 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock
  #6: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages
  #7: ffffc1e2cf13bf68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable
 __zone_set_pageset_high_and_batch at mm/page_alloc.c:7005
 (inlined by) zone_pcp_disable at mm/page_alloc.c:9286

Later, running some kernel compilation workloads could trigger a crash.

 Unable to handle kernel paging request at virtual address fffffbfffe000030
 KASAN: maybe wild-memory-access in range [0x0003dffff0000180-0x0003dffff0000187]
 Mem abort info:
   ESR = 0x96000006
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x06: level 2 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000006
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000817545fd000
 [fffffbfffe000030] pgd=00000817581e9003, p4d=00000817581e9003, pud=00000817581ea003, pmd=0000000000000000
 Internal error: Oops: 96000006 [#1] PREEMPT SMP
 Modules linked in: bridge stp llc cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod nouveau drm_ttm_helper ttm crct10dif_ce mlx5_core drm_display_helper drm_kms_helper nvme mpt3sas xhci_pci nvme_core drm raid_class xhci_pci_renesas
 CPU: 147 PID: 3334 Comm: tee Not tainted 5.18.0-rc4-next-20220426-dirty #27
 pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : isolate_single_pageblock
 lr : isolate_single_pageblock
 sp : ffff80003e767500
 x29: ffff80003e767500 x28: 0000000000000000 x27: ffff783c59963b1f
 x26: dfff800000000000 x25: ffffc1e2ccb1d000 x24: ffffc1e2ccb1d8f8
 x23: 00000000803bfe00 x22: ffffc1e2cee39098 x21: 0000000000000020
 x20: 00000000803c0000 x19: fffffbfffe000000 x18: ffffc1e2cee37d1c
 x17: 0000000000000000 x16: 1fffe8004a86f14c x15: 1fffe806c89e154a
 x14: 1fffe8004a86f11c x13: 0000000000000004 x12: ffff783c5c455e6d
 x11: 1ffff83c5c455e6c x10: ffff783c5c455e6c x9 : dfff800000000000
 x8 : ffffc1e2e22af363 x7 : 0000000000000001 x6 : 0000000000000003
 x5 : ffffc1e2e22af360 x4 : ffff783c5c455e6c x3 : ffff700007cece90
 x2 : 0000000000000003 x1 : 0000000000000000 x0 : fffffbfffe000030
 Call trace:
 Call trace:
  isolate_single_pageblock
  PageBuddy at ./include/linux/page-flags.h:969 (discriminator 3)
  (inlined by) isolate_single_pageblock at mm/page_isolation.c:414 (discriminator 3)
  start_isolate_page_range
  offline_pages
  memory_subsys_offline
  device_offline
  online_store
  dev_attr_store
  sysfs_kf_write
  kernfs_fop_write_iter
  new_sync_write
  vfs_write
  ksys_write
  __arm64_sys_write
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 Code: 38fa6821 7100003f 7a411041 54000dca (b9403260)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x41e2c0720000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x000,0021700d,19801c82
 Memory Limit: none

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 20:18 ` [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Qian Cai
@ 2022-04-26 20:26   ` Zi Yan
  2022-04-26 21:08     ` Qian Cai
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-04-26 20:26 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 10870 bytes --]

On 26 Apr 2022, at 16:18, Qian Cai wrote:

> On Mon, Apr 25, 2022 at 10:31:12AM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi David,
>>
>> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
>> and alloc_contig_range(). It prepares for my upcoming changes to make
>> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.
>>
>> Changelog
>> ===
>> V11
>> ---
>> 1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
>>    change to a separate patch after the unmovable page check change and
>>    alloc_contig_range() change to avoid some unwanted memory
>>    hotplug/hotremove failures.
>> 2. Cleaned up has_unmovable_pages() in Patch 2.
>>
>> V10
>> ---
>> 1. Reverted back to the original outer_start, outer_end range for
>>    test_pages_isolated() and isolate_freepages_range() in Patch 3,
>>    otherwise isolation will fail if start in alloc_contig_range() is in
>>    the middle of a free page.
>>
>> V9
>> ---
>> 1. Limited has_unmovable_pages() check within a pageblock.
>> 2. Added a check to ensure page isolation is done within a single zone
>>    in isolate_single_pageblock().
>> 3. Fixed an off-by-one bug in isolate_single_pageblock().
>> 4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
>>    is not online in isolate_single_pageblock().
>>
>> V8
>> ---
>> 1. Cleaned up has_unmovable_pages() to remove page argument.
>>
>> V7
>> ---
>> 1. Added page validity check in isolate_single_pageblock() to avoid out
>>    of zone pages.
>> 2. Fixed a bug in split_free_page() to split and free pages in correct
>>    page order.
>>
>> V6
>> ---
>> 1. Resolved compilation error/warning reported by kernel test robot.
>> 2. Tried to solve the coding concerns from Christophe Leroy.
>> 3. Shortened lengthy lines (pointed out by Christoph Hellwig).
>>
>> V5
>> ---
>> 1. Moved isolation address alignment handling in start_isolate_page_range().
>> 2. Rewrote and simplified how alloc_contig_range() works at pageblock
>>    granularity (Patch 3). Only two pageblock migratetypes need to be saved and
>>    restored. start_isolate_page_range() might need to migrate pages in this
>>    version, but it prevents the caller from worrying about
>>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
>>    is isolated.
>>
>> V4
>> ---
>> 1. Dropped two irrelevant patches on non-lru compound page handling, as
>>    it is not supported upstream.
>> 2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
>> 3. Always check whether two pageblocks can be merged in
>>    __free_one_page() when order is >= pageblock_order, as the case (not
>>    mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
>> 3. Moving has_unmovable_pages() is now a separate patch.
>> 4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.
>>
>> Description
>> ===
>>
>> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
>> isolates pageblocks to remove free memory from buddy allocator but isolating
>> only a subset of pageblocks within a page spanning across multiple pageblocks
>> causes free page accounting issues. Isolated page might not be put into the
>> right free list, since the code assumes the migratetype of the first pageblock
>> as the whole free page migratetype. This is based on the discussion at [2].
>>
>> To remove the requirement, this patchset:
>> 1. isolates pages at pageblock granularity instead of
>>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
>> 2. splits free pages across the specified range or migrates in-use pages
>>    across the specified range then splits the freed page to avoid free page
>>    accounting issues (it happens when multiple pageblocks within a single page
>>    have different migratetypes);
>> 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
>>    range during isolation to avoid alloc_contig_range() failure when pageblocks
>>    within a MAX_ORDER - 1 aligned range are allocated separately.
>> 4. returns pages not in the range as it did before.
>>
>> One optimization might come later:
>> 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
>>    migratetypes when isolation fails in the middle of the range.
>>
>> Feel free to give comments and suggestions. Thanks.
>>
>> [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
>> [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/
>>
>> Zi Yan (6):
>>   mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
>>   mm: page_isolation: check specified range for unmovable pages
>>   mm: make alloc_contig_range work at pageblock granularity
>>   mm: page_isolation: enable arbitrary range page isolation.
>>   mm: cma: use pageblock_order as the single alignment
>>   drivers: virtio_mem: use pageblock size as the minimum virtio_mem
>>     size.
>>
>>  drivers/virtio/virtio_mem.c    |   6 +-
>>  include/linux/cma.h            |   4 +-
>>  include/linux/mmzone.h         |   5 +-
>>  include/linux/page-isolation.h |   6 +-
>>  mm/internal.h                  |   6 +
>>  mm/memory_hotplug.c            |   3 +-
>>  mm/page_alloc.c                | 191 +++++-------------
>>  mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
>>  8 files changed, 392 insertions(+), 174 deletions(-)
>
> Reverting this series fixed a deadlock during memory offline/online
> tests and then a crash.

Hi Qian,

Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?

>
>  INFO: task kmemleak:1027 blocked for more than 120 seconds.
>        Not tainted 5.18.0-rc4-next-20220426-dirty #27
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kmemleak        state:D stack:27744 pid: 1027 ppid:     2 flags:0x00000008
>  Call trace:
>   __switch_to
>   __schedule
>   schedule
>   percpu_rwsem_wait
>   __percpu_down_read
>   percpu_down_read.constprop.0
>   get_online_mems
>   kmemleak_scan
>   kmemleak_scan_thread
>   kthread
>   ret_from_fork
>
>  Showing all locks held in the system:
>  1 lock held by rcu_tasks_kthre/11:
>   #0: ffffc1e2cefc17f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by rcu_tasks_rude_/12:
>   #0: ffffc1e2cefc1a90 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by rcu_tasks_trace/13:
>   #0: ffffc1e2cefc1db0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by khungtaskd/824:
>   #0: ffffc1e2cefc2820 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks
>  2 locks held by kmemleak/1027:
>   #0: ffffc1e2cf1aa628 (scan_mutex){+.+.}-{3:3}, at: kmemleak_scan_thread
>   #1: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: get_online_mems
>  2 locks held by cppc_fie/1805:
>  1 lock held by in:imklog/2822:
>  8 locks held by tee/3334:
>   #0: ffff0816d65c9438 (sb_writers#6){.+.+}-{0:0}, at: vfs_write
>   #1: ffff40025438be88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter
>   #2: ffff4000c8261eb0 (kn->active#298){.+.+}-{0:0}, at: kernfs_fop_write_iter
>   #3: ffffc1e2d0013f68 (device_hotplug_lock){+.+.}-{3:3}, at: online_store
>   #4: ffff0800cd8bb998 (&dev->mutex){....}-{3:3}, at: device_offline
>   #5: ffffc1e2ceed3750 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock
>   #6: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages
>   #7: ffffc1e2cf13bf68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable
>  __zone_set_pageset_high_and_batch at mm/page_alloc.c:7005
>  (inlined by) zone_pcp_disable at mm/page_alloc.c:9286
>
> Later, running some kernel compilation workloads could trigger a crash.
>
>  Unable to handle kernel paging request at virtual address fffffbfffe000030
>  KASAN: maybe wild-memory-access in range [0x0003dffff0000180-0x0003dffff0000187]
>  Mem abort info:
>    ESR = 0x96000006
>    EC = 0x25: DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>    FSC = 0x06: level 2 translation fault
>  Data abort info:
>    ISV = 0, ISS = 0x00000006
>    CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000817545fd000
>  [fffffbfffe000030] pgd=00000817581e9003, p4d=00000817581e9003, pud=00000817581ea003, pmd=0000000000000000
>  Internal error: Oops: 96000006 [#1] PREEMPT SMP
>  Modules linked in: bridge stp llc cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod nouveau drm_ttm_helper ttm crct10dif_ce mlx5_core drm_display_helper drm_kms_helper nvme mpt3sas xhci_pci nvme_core drm raid_class xhci_pci_renesas
>  CPU: 147 PID: 3334 Comm: tee Not tainted 5.18.0-rc4-next-20220426-dirty #27
>  pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : isolate_single_pageblock
>  lr : isolate_single_pageblock
>  sp : ffff80003e767500
>  x29: ffff80003e767500 x28: 0000000000000000 x27: ffff783c59963b1f
>  x26: dfff800000000000 x25: ffffc1e2ccb1d000 x24: ffffc1e2ccb1d8f8
>  x23: 00000000803bfe00 x22: ffffc1e2cee39098 x21: 0000000000000020
>  x20: 00000000803c0000 x19: fffffbfffe000000 x18: ffffc1e2cee37d1c
>  x17: 0000000000000000 x16: 1fffe8004a86f14c x15: 1fffe806c89e154a
>  x14: 1fffe8004a86f11c x13: 0000000000000004 x12: ffff783c5c455e6d
>  x11: 1ffff83c5c455e6c x10: ffff783c5c455e6c x9 : dfff800000000000
>  x8 : ffffc1e2e22af363 x7 : 0000000000000001 x6 : 0000000000000003
>  x5 : ffffc1e2e22af360 x4 : ffff783c5c455e6c x3 : ffff700007cece90
>  x2 : 0000000000000003 x1 : 0000000000000000 x0 : fffffbfffe000030
>  Call trace:
>  Call trace:
>   isolate_single_pageblock
>   PageBuddy at ./include/linux/page-flags.h:969 (discriminator 3)
>   (inlined by) isolate_single_pageblock at mm/page_isolation.c:414 (discriminator 3)
>   start_isolate_page_range
>   offline_pages
>   memory_subsys_offline
>   device_offline
>   online_store
>   dev_attr_store
>   sysfs_kf_write
>   kernfs_fop_write_iter
>   new_sync_write
>   vfs_write
>   ksys_write
>   __arm64_sys_write
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
>  Code: 38fa6821 7100003f 7a411041 54000dca (b9403260)
>  ---[ end trace 0000000000000000 ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Kernel Offset: 0x41e2c0720000 from 0xffff800008000000
>  PHYS_OFFSET: 0x80000000
>  CPU features: 0x000,0021700d,19801c82
>  Memory Limit: none

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 20:26   ` Zi Yan
@ 2022-04-26 21:08     ` Qian Cai
  2022-04-26 21:38       ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-04-26 21:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Tue, Apr 26, 2022 at 04:26:08PM -0400, Zi Yan wrote:
> Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?

Nothing fancy. It just try to remove and add back each memory section.

#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0

import os
import re
import subprocess


def mem_iter():
    base_dir = '/sys/devices/system/memory/'
    for curr_dir in os.listdir(base_dir):
        if re.match(r'memory\d+', curr_dir):
            yield base_dir + curr_dir


if __name__ == '__main__':
    print('- Try to remove each memory section and then add it back.')
    for mem_dir in mem_iter():
        status = f'{mem_dir}/online'
        if open(status).read().rstrip() == '1':
            # This could expectedly fail due to many reasons.
            section = os.path.basename(mem_dir)
            print(f'- Try to remove {section}.')
            proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
            if proc.returncode == 0:
                print(f'- Try to add {section}.')
                subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 21:08     ` Qian Cai
@ 2022-04-26 21:38       ` Zi Yan
  2022-04-27 12:41         ` Qian Cai
                           ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-26 21:38 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1492 bytes --]

On 26 Apr 2022, at 17:08, Qian Cai wrote:

> On Tue, Apr 26, 2022 at 04:26:08PM -0400, Zi Yan wrote:
>> Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?
>
> Nothing fancy. It just try to remove and add back each memory section.
>
> #!/usr/bin/env python3
> # SPDX-License-Identifier: GPL-2.0
>
> import os
> import re
> import subprocess
>
>
> def mem_iter():
>     base_dir = '/sys/devices/system/memory/'
>     for curr_dir in os.listdir(base_dir):
>         if re.match(r'memory\d+', curr_dir):
>             yield base_dir + curr_dir
>
>
> if __name__ == '__main__':
>     print('- Try to remove each memory section and then add it back.')
>     for mem_dir in mem_iter():
>         status = f'{mem_dir}/online'
>         if open(status).read().rstrip() == '1':
>             # This could expectedly fail due to many reasons.
>             section = os.path.basename(mem_dir)
>             print(f'- Try to remove {section}.')
>             proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
>             if proc.returncode == 0:
>                 print(f'- Try to add {section}.')
>                 subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)

Thanks. Do you mind attaching your config file? I cannot reproduce
the deadlock locally using my own config. I also see kmemleak_scan
in the dumped stack, so it must be something else in addition to
memory online/offline causing the issue.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 21:38       ` Zi Yan
@ 2022-04-27 12:41         ` Qian Cai
  2022-04-27 13:10         ` Qian Cai
  2022-04-27 13:27         ` Qian Cai
  2 siblings, 0 replies; 44+ messages in thread
From: Qian Cai @ 2022-04-27 12:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

This is on an arm64 server.

$ make defconfig debug.config

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 21:38       ` Zi Yan
  2022-04-27 12:41         ` Qian Cai
@ 2022-04-27 13:10         ` Qian Cai
  2022-04-27 13:27         ` Qian Cai
  2 siblings, 0 replies; 44+ messages in thread
From: Qian Cai @ 2022-04-27 13:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

Of course it also need to enable those. The kmemleak_scan is just a
symptom of one of the online operations is blocking forever, as the
locks were never released.

CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
CONFIG_MEMORY_HOTREMOVE=y

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-26 21:38       ` Zi Yan
  2022-04-27 12:41         ` Qian Cai
  2022-04-27 13:10         ` Qian Cai
@ 2022-04-27 13:27         ` Qian Cai
  2022-04-27 13:30           ` Zi Yan
  2 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-04-27 13:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

Actually, it is one of those *offline* operations, i.e.,

echo 0 > /sys/devices/system/memory/memoryNNN/online

looping forever which never finish after more than 2-hour.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-27 13:27         ` Qian Cai
@ 2022-04-27 13:30           ` Zi Yan
  2022-04-27 21:04             ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-04-27 13:30 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

On 27 Apr 2022, at 9:27, Qian Cai wrote:

> On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
>> Thanks. Do you mind attaching your config file? I cannot reproduce
>> the deadlock locally using my own config. I also see kmemleak_scan
>> in the dumped stack, so it must be something else in addition to
>> memory online/offline causing the issue.
>
> Actually, it is one of those *offline* operations, i.e.,
>
> echo 0 > /sys/devices/system/memory/memoryNNN/online
>
> looping forever which never finish after more than 2-hour.

Thank you for the detailed information. I am able to reproduce the
issue locally. I will update the patch once I fix the bug.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-27 13:30           ` Zi Yan
@ 2022-04-27 21:04             ` Zi Yan
  2022-04-28 12:33               ` Qian Cai
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-04-27 21:04 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 7038 bytes --]

On 27 Apr 2022, at 9:30, Zi Yan wrote:

> On 27 Apr 2022, at 9:27, Qian Cai wrote:
>
>> On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
>>> Thanks. Do you mind attaching your config file? I cannot reproduce
>>> the deadlock locally using my own config. I also see kmemleak_scan
>>> in the dumped stack, so it must be something else in addition to
>>> memory online/offline causing the issue.
>>
>> Actually, it is one of those *offline* operations, i.e.,
>>
>> echo 0 > /sys/devices/system/memory/memoryNNN/online
>>
>> looping forever which never finish after more than 2-hour.
>
> Thank you for the detailed information. I am able to reproduce the
> issue locally. I will update the patch once I fix the bug.

Hi Qian,

Do you mind checking if the patch below fixes the issue? It works
for me.

The original code was trying to migrate non-migratible compound pages
(high-order slab pages from my tests) during isolation and caused
an infinite loop. The patch avoids non-migratible pages.

I will update my patch series once we confirm the patch fixes
the bug.

Thanks.

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 75e454f5cf45..c39980fce626 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -367,58 +367,68 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                }
                /*
                 * migrate compound pages then let the free page handling code
-                * above do the rest. If migration is not enabled, just fail.
+                * above do the rest. If migration is not possible, just fail.
                 */
-               if (PageHuge(page) || PageTransCompound(page)) {
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+               if (PageCompound(page)) {
                        unsigned long nr_pages = compound_nr(page);
-                       int order = compound_order(page);
                        struct page *head = compound_head(page);
                        unsigned long head_pfn = page_to_pfn(head);
-                       int ret;
-                       struct compact_control cc = {
-                               .nr_migratepages = 0,
-                               .order = -1,
-                               .zone = page_zone(pfn_to_page(head_pfn)),
-                               .mode = MIGRATE_SYNC,
-                               .ignore_skip_hint = true,
-                               .no_set_skip_hint = true,
-                               .gfp_mask = gfp_flags,
-                               .alloc_contig = true,
-                       };
-                       INIT_LIST_HEAD(&cc.migratepages);

                        if (head_pfn + nr_pages < boundary_pfn) {
-                               pfn += nr_pages;
+                               pfn = head_pfn + nr_pages;
                                continue;
                        }

-                       ret = __alloc_contig_migrate_range(&cc, head_pfn,
-                                               head_pfn + nr_pages);
-
-                       if (ret)
-                               goto failed;
+#if defined CONFIG_MIGRATION
                        /*
-                        * reset pfn, let the free page handling code above
-                        * split the free page to the right migratetype list.
-                        *
-                        * head_pfn is not used here as a hugetlb page order
-                        * can be bigger than MAX_ORDER-1, but after it is
-                        * freed, the free page order is not. Use pfn within
-                        * the range to find the head of the free page and
-                        * reset order to 0 if a hugetlb page with
-                        * >MAX_ORDER-1 order is encountered.
+                        * hugetlb, lru compound (THP), and movable compound pages
+                        * can be migrated. Otherwise, fail the isolation.
                         */
-                       if (order > MAX_ORDER-1)
+                       if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+                               int order;
+                               unsigned long outer_pfn;
+                               int ret;
+                               struct compact_control cc = {
+                                       .nr_migratepages = 0,
+                                       .order = -1,
+                                       .zone = page_zone(pfn_to_page(head_pfn)),
+                                       .mode = MIGRATE_SYNC,
+                                       .ignore_skip_hint = true,
+                                       .no_set_skip_hint = true,
+                                       .gfp_mask = gfp_flags,
+                                       .alloc_contig = true,
+                               };
+                               INIT_LIST_HEAD(&cc.migratepages);
+
+                               ret = __alloc_contig_migrate_range(&cc, head_pfn,
+                                                       head_pfn + nr_pages);
+
+                               if (ret)
+                                       goto failed;
+                               /*
+                                * reset pfn to the head of the free page, so
+                                * that the free page handling code above can split
+                                * the free page to the right migratetype list.
+                                *
+                                * head_pfn is not used here as a hugetlb page order
+                                * can be bigger than MAX_ORDER-1, but after it is
+                                * freed, the free page order is not. Use pfn within
+                                * the range to find the head of the free page.
+                                */
                                order = 0;
-                       while (!PageBuddy(pfn_to_page(pfn))) {
-                               order++;
-                               pfn &= ~0UL << order;
-                       }
-                       continue;
-#else
-                       goto failed;
+                               outer_pfn = pfn;
+                               while (!PageBuddy(pfn_to_page(outer_pfn))) {
+                                       if (++order >= MAX_ORDER) {
+                                               outer_pfn = pfn;
+                                               break;
+                                       }
+                                       outer_pfn &= ~0UL << order;
+                               }
+                               pfn = outer_pfn;
+                               continue;
+                       } else
 #endif
+                               goto failed;
                }

                pfn++;
--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-27 21:04             ` Zi Yan
@ 2022-04-28 12:33               ` Qian Cai
  2022-04-28 12:39                 ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-04-28 12:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Wed, Apr 27, 2022 at 05:04:39PM -0400, Zi Yan wrote:
> Do you mind checking if the patch below fixes the issue? It works
> for me.
> 
> The original code was trying to migrate non-migratible compound pages
> (high-order slab pages from my tests) during isolation and caused
> an infinite loop. The patch avoids non-migratible pages.
> 
> I will update my patch series once we confirm the patch fixes
> the bug.

I am not able to apply it on today's linux-next tree.

$ patch -Np1 --dry-run < ../patch/migrate.patch
checking file mm/page_isolation.c
Hunk #1 FAILED at 367.
1 out of 1 hunk FAILED

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-28 12:33               ` Qian Cai
@ 2022-04-28 12:39                 ` Zi Yan
  2022-04-28 16:19                   ` Qian Cai
  2022-05-19 20:57                   ` Qian Cai
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-28 12:39 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton


[-- Attachment #1.1: Type: text/plain, Size: 807 bytes --]

On 28 Apr 2022, at 8:33, Qian Cai wrote:

> On Wed, Apr 27, 2022 at 05:04:39PM -0400, Zi Yan wrote:
>> Do you mind checking if the patch below fixes the issue? It works
>> for me.
>>
>> The original code was trying to migrate non-migratible compound pages
>> (high-order slab pages from my tests) during isolation and caused
>> an infinite loop. The patch avoids non-migratible pages.
>>
>> I will update my patch series once we confirm the patch fixes
>> the bug.
>
> I am not able to apply it on today's linux-next tree.
>
> $ patch -Np1 --dry-run < ../patch/migrate.patch
> checking file mm/page_isolation.c
> Hunk #1 FAILED at 367.
> 1 out of 1 hunk FAILED

How about the one attached? I can apply it to next-20220428. Let me know
if you are using a different branch. Thanks.


--
Best Regards,
Yan, Zi

[-- Attachment #1.2: 0001-fix-what-can-be-migrated-what-cannot.patch --]
[-- Type: text/plain, Size: 3885 bytes --]

From 1567f4dbc287f6fe2fa6d4dc63fa1f9137692cff Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Wed, 27 Apr 2022 16:49:22 -0400
Subject: [PATCH] fix what can be migrated what cannot.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_isolation.c | 88 +++++++++++++++++++++++++--------------------
 1 file changed, 49 insertions(+), 39 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 75e454f5cf45..7968a1dd692a 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -367,58 +367,68 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 		}
 		/*
 		 * migrate compound pages then let the free page handling code
-		 * above do the rest. If migration is not enabled, just fail.
+		 * above do the rest. If migration is not possible, just fail.
 		 */
-		if (PageHuge(page) || PageTransCompound(page)) {
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+		if (PageCompound(page)) {
 			unsigned long nr_pages = compound_nr(page);
-			int order = compound_order(page);
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);
-			int ret;
-			struct compact_control cc = {
-				.nr_migratepages = 0,
-				.order = -1,
-				.zone = page_zone(pfn_to_page(head_pfn)),
-				.mode = MIGRATE_SYNC,
-				.ignore_skip_hint = true,
-				.no_set_skip_hint = true,
-				.gfp_mask = gfp_flags,
-				.alloc_contig = true,
-			};
-			INIT_LIST_HEAD(&cc.migratepages);
 
 			if (head_pfn + nr_pages < boundary_pfn) {
-				pfn += nr_pages;
+				pfn = head_pfn + nr_pages;
 				continue;
 			}
 
-			ret = __alloc_contig_migrate_range(&cc, head_pfn,
-						head_pfn + nr_pages);
-
-			if (ret)
-				goto failed;
+#if defined CONFIG_MIGRATION
 			/*
-			 * reset pfn, let the free page handling code above
-			 * split the free page to the right migratetype list.
-			 *
-			 * head_pfn is not used here as a hugetlb page order
-			 * can be bigger than MAX_ORDER-1, but after it is
-			 * freed, the free page order is not. Use pfn within
-			 * the range to find the head of the free page and
-			 * reset order to 0 if a hugetlb page with
-			 * >MAX_ORDER-1 order is encountered.
+			 * hugetlb, lru compound (THP), and movable compound pages
+			 * can be migrated. Otherwise, fail the isolation.
 			 */
-			if (order > MAX_ORDER-1)
+			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+				int order;
+				unsigned long outer_pfn;
+				int ret;
+				struct compact_control cc = {
+					.nr_migratepages = 0,
+					.order = -1,
+					.zone = page_zone(pfn_to_page(head_pfn)),
+					.mode = MIGRATE_SYNC,
+					.ignore_skip_hint = true,
+					.no_set_skip_hint = true,
+					.gfp_mask = gfp_flags,
+					.alloc_contig = true,
+				};
+				INIT_LIST_HEAD(&cc.migratepages);
+
+				ret = __alloc_contig_migrate_range(&cc, head_pfn,
+							head_pfn + nr_pages);
+
+				if (ret)
+					goto failed;
+				/*
+				 * reset pfn to the head of the free page, so
+				 * that the free page handling code above can split
+				 * the free page to the right migratetype list.
+				 *
+				 * head_pfn is not used here as a hugetlb page order
+				 * can be bigger than MAX_ORDER-1, but after it is
+				 * freed, the free page order is not. Use pfn within
+				 * the range to find the head of the free page.
+				 */
 				order = 0;
-			while (!PageBuddy(pfn_to_page(pfn))) {
-				order++;
-				pfn &= ~0UL << order;
-			}
-			continue;
-#else
-			goto failed;
+				outer_pfn = pfn;
+				while (!PageBuddy(pfn_to_page(outer_pfn))) {
+					if (++order >= MAX_ORDER) {
+						outer_pfn = pfn;
+						break;
+					}
+					outer_pfn &= ~0UL << order;
+				}
+				pfn = outer_pfn;
+				continue;
+			} else
 #endif
+				goto failed;
 		}
 
 		pfn++;
-- 
2.35.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-28 12:39                 ` Zi Yan
@ 2022-04-28 16:19                   ` Qian Cai
  2022-04-29 13:38                     ` Zi Yan
  2022-05-19 20:57                   ` Qian Cai
  1 sibling, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-04-28 16:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
> How about the one attached? I can apply it to next-20220428. Let me know
> if you are using a different branch. Thanks.

The original endless loop is gone, but running some syscall fuzzer
afterwards for a while would trigger the warning here. I have yet to
figure out if this is related to this series.

        /*
         * There are several places where we assume that the order value is sane
         * so bail out early if the request is out of bound.
         */
        if (unlikely(order >= MAX_ORDER)) {
                WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
                return NULL;
        }

 WARNING: CPU: 26 PID: 172874 at mm/page_alloc.c:5368 __alloc_pages
 CPU: 26 PID: 172874 Comm: trinity-main Not tainted 5.18.0-rc4-next-20220428-dirty #67
 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 tpidr_el2 : ffff28cf80a61000
 pc : __alloc_pages
 lr : alloc_pages
 sp : ffff8000597b70f0
 x29: ffff8000597b70f0 x28: ffff0801e68d34c0 x27: 0000000000000000
 x26: 1ffff0000b2f6ea2 x25: ffff8000597b7510 x24: 0000000000000dc0
 x23: ffff28cf80a61000 x22: 000000000000000e x21: 1ffff0000b2f6e28
 x20: 0000000000040dc0 x19: ffffdf670d4a6fe0 x18: ffffdf66fa017d1c
 x17: ffffdf66f42f8348 x16: 1fffe1003cd1a7b3 x15: 000000000000001a
 x14: 1fffe1003cd1a7a6 x13: 0000000000000004 x12: ffff70000b2f6e05
 x11: 1ffff0000b2f6e04 x10: 00000000f204f1f1 x9 : 000000000000f204
 x8 : dfff800000000000 x7 : 00000000f3000000 x6 : 00000000f3f3f3f3
 x5 : ffff70000b2f6e28 x4 : ffff0801e68d34c0 x3 : 0000000000000000
 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
 Call trace:
  __alloc_pages
  alloc_pages
  kmalloc_order
  kmalloc_order_trace
  __kmalloc
  __regset_get
  regset_get_alloc
  fill_thread_core_info
  fill_note_info
  elf_core_dump
  do_coredump
  get_signal
  do_signal
  do_notify_resume
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 irq event stamp: 3614
 hardirqs last  enabled at (3613):  _raw_spin_unlock_irqrestore
 hardirqs last disabled at (3614):  el1_dbg
 softirqs last  enabled at (2988):  fpsimd_preserve_current_state
 softirqs last disabled at (2986):  fpsimd_preserve_current_state

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-28 16:19                   ` Qian Cai
@ 2022-04-29 13:38                     ` Zi Yan
  0 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-29 13:38 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

On 28 Apr 2022, at 12:19, Qian Cai wrote:

> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>> How about the one attached? I can apply it to next-20220428. Let me know
>> if you are using a different branch. Thanks.
>
> The original endless loop is gone, but running some syscall fuzzer

Thanks for the confirmation.

> afterwards for a while would trigger the warning here. I have yet to
> figure out if this is related to this series.
>
>         /*
>          * There are several places where we assume that the order value is sane
>          * so bail out early if the request is out of bound.
>          */
>         if (unlikely(order >= MAX_ORDER)) {
>                 WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
>                 return NULL;
>         }
>
>  WARNING: CPU: 26 PID: 172874 at mm/page_alloc.c:5368 __alloc_pages
>  CPU: 26 PID: 172874 Comm: trinity-main Not tainted 5.18.0-rc4-next-20220428-dirty #67
>  pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  tpidr_el2 : ffff28cf80a61000
>  pc : __alloc_pages
>  lr : alloc_pages
>  sp : ffff8000597b70f0
>  x29: ffff8000597b70f0 x28: ffff0801e68d34c0 x27: 0000000000000000
>  x26: 1ffff0000b2f6ea2 x25: ffff8000597b7510 x24: 0000000000000dc0
>  x23: ffff28cf80a61000 x22: 000000000000000e x21: 1ffff0000b2f6e28
>  x20: 0000000000040dc0 x19: ffffdf670d4a6fe0 x18: ffffdf66fa017d1c
>  x17: ffffdf66f42f8348 x16: 1fffe1003cd1a7b3 x15: 000000000000001a
>  x14: 1fffe1003cd1a7a6 x13: 0000000000000004 x12: ffff70000b2f6e05
>  x11: 1ffff0000b2f6e04 x10: 00000000f204f1f1 x9 : 000000000000f204
>  x8 : dfff800000000000 x7 : 00000000f3000000 x6 : 00000000f3f3f3f3
>  x5 : ffff70000b2f6e28 x4 : ffff0801e68d34c0 x3 : 0000000000000000
>  x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
>  Call trace:
>   __alloc_pages
>   alloc_pages
>   kmalloc_order
>   kmalloc_order_trace
>   __kmalloc
>   __regset_get
>   regset_get_alloc
>   fill_thread_core_info
>   fill_note_info
>   elf_core_dump
>   do_coredump
>   get_signal
>   do_signal
>   do_notify_resume
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
>  irq event stamp: 3614
>  hardirqs last  enabled at (3613):  _raw_spin_unlock_irqrestore
>  hardirqs last disabled at (3614):  el1_dbg
>  softirqs last  enabled at (2988):  fpsimd_preserve_current_state
>  softirqs last disabled at (2986):  fpsimd_preserve_current_state

I got an email this morning reporting a warning with the same call trace:
https://lore.kernel.org/linux-mm/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com/

The email says the warning appeared from next-20220427, but my
patchset was in linux-next since next-20220426. In addition,
my patches do not touch any function in the call trace. I assume
this warning is not related to my patchset. But let me know
if my patchset is related.

Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-04-25 14:31 ` [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity Zi Yan
@ 2022-04-29 13:54   ` Zi Yan
  2022-05-24 19:00     ` Zi Yan
  2022-05-25 17:41     ` Doug Berger
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2022-04-29 13:54 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, kernel test robot, Qian Cai

[-- Attachment #1: Type: text/plain, Size: 37990 bytes --]

On 25 Apr 2022, at 10:31, Zi Yan wrote:

> From: Zi Yan <ziy@nvidia.com>
>
> alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
> merging pageblocks with different migratetypes. It might unnecessarily
> convert extra pageblocks at the beginning and at the end of the range.
> Change alloc_contig_range() to work at pageblock granularity.
>
> Special handling is needed for free pages and in-use pages across the
> boundaries of the range specified by alloc_contig_range(). Because these
> partially isolated pages causes free page accounting issues. The free
> pages will be split and freed into separate migratetype lists; the
> in-use pages will be migrated then the freed pages will be handled in
> the aforementioned way.
>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/page-isolation.h |   4 +-
>  mm/internal.h                  |   6 ++
>  mm/memory_hotplug.c            |   3 +-
>  mm/page_alloc.c                |  54 ++++++++--
>  mm/page_isolation.c            | 184 ++++++++++++++++++++++++++++++++-
>  5 files changed, 233 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index e14eddf6741a..5456b7be38ae 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
>   */
>  int
>  start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			 unsigned migratetype, int flags);
> +			 int migratetype, int flags, gfp_t gfp_flags);
>
>  /*
>   * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
> @@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>   */
>  void
>  undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			unsigned migratetype);
> +			int migratetype);
>
>  /*
>   * Test all pages in [start_pfn, end_pfn) are isolated or not.
> diff --git a/mm/internal.h b/mm/internal.h
> index 919fa07e1031..0667abd57634 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>  			  phys_addr_t min_addr,
>  			  int nid, bool exact_nid);
>
> +void split_free_page(struct page *free_page,
> +				int order, unsigned long split_pfn_offset);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
>  /*
> @@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
>  int
>  isolate_migratepages_range(struct compact_control *cc,
>  			   unsigned long low_pfn, unsigned long end_pfn);
> +
> +int __alloc_contig_migrate_range(struct compact_control *cc,
> +					unsigned long start, unsigned long end);
>  #endif
>  int find_suitable_fallback(struct free_area *area, unsigned int order,
>  			int migratetype, bool only_stealable, bool *can_steal);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4c6065e5d274..9f8ae4cb77ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
> -				       MEMORY_OFFLINE | REPORT_FAILURE);
> +				       MEMORY_OFFLINE | REPORT_FAILURE,
> +				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>  	if (ret) {
>  		reason = "failure to isolate range";
>  		goto failed_removal_pcplists_disabled;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ce23ac8ad085..70ddd9a0bcf3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
>  		page_reporting_notify_free(order);
>  }
>
> +/**
> + * split_free_page() -- split a free page at split_pfn_offset
> + * @free_page:		the original free page
> + * @order:		the order of the page
> + * @split_pfn_offset:	split offset within the page
> + *
> + * It is used when the free page crosses two pageblocks with different migratetypes
> + * at split_pfn_offset within the page. The split free page will be put into
> + * separate migratetype lists afterwards. Otherwise, the function achieves
> + * nothing.
> + */
> +void split_free_page(struct page *free_page,
> +				int order, unsigned long split_pfn_offset)
> +{
> +	struct zone *zone = page_zone(free_page);
> +	unsigned long free_page_pfn = page_to_pfn(free_page);
> +	unsigned long pfn;
> +	unsigned long flags;
> +	int free_page_order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	del_page_from_free_list(free_page, zone, order);
> +	for (pfn = free_page_pfn;
> +	     pfn < free_page_pfn + (1UL << order);) {
> +		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> +
> +		free_page_order = ffs(split_pfn_offset) - 1;
> +		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
> +				mt, FPI_NONE);
> +		pfn += 1UL << free_page_order;
> +		split_pfn_offset -= (1UL << free_page_order);
> +		/* we have done the first part, now switch to second part */
> +		if (split_pfn_offset == 0)
> +			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple branches,
>   * try and check multiple fields with one check. The caller must do a detailed
> @@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
>  #endif
>
>  /* [start, end) must belong to a single zone. */
> -static int __alloc_contig_migrate_range(struct compact_control *cc,
> +int __alloc_contig_migrate_range(struct compact_control *cc,
>  					unsigned long start, unsigned long end)
>  {
>  	/* This function is based on compact_zone() from compaction.c. */
> @@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype, gfp_t gfp_mask)
>  {
>  	unsigned long outer_start, outer_end;
> -	unsigned int order;
> +	int order;
>  	int ret = 0;
>
>  	struct compact_control cc = {
> @@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * What we do here is we mark all pageblocks in range as
>  	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
>  	 * have different sizes, and due to the way page allocator
> -	 * work, we align the range to biggest of the two pages so
> -	 * that page allocator won't try to merge buddies from
> -	 * different pageblocks and change MIGRATE_ISOLATE to some
> -	 * other migration type.
> +	 * work, start_isolate_page_range() has special handlings for this.
>  	 *
>  	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
>  	 * migrate the pages from an unaligned range (ie. pages that
> -	 * we are interested in).  This will put all the pages in
> +	 * we are interested in). This will put all the pages in
>  	 * range back to page allocator as MIGRATE_ISOLATE.
>  	 *
>  	 * When this is done, we take the pages in range from page
> @@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 */
>
>  	ret = start_isolate_page_range(pfn_max_align_down(start),
> -				       pfn_max_align_up(end), migratetype, 0);
> +				pfn_max_align_up(end), migratetype, 0, gfp_mask);
>  	if (ret)
> -		return ret;
> +		goto done;
>
>  	drain_all_pages(cc.zone);
>
> @@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	ret = 0;
>
>  	/*
> -	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
> +	 * Pages from [start, end) are within a pageblock_nr_pages
>  	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
>  	 * more, all pages in [start, end) are free in page allocator.
>  	 * What we are going to do is to allocate all pages from
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c2f7a8bb634d..94b3467e5ba2 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  	return -EBUSY;
>  }
>
> -static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> +static void unset_migratetype_isolate(struct page *page, int migratetype)
>  {
>  	struct zone *zone;
>  	unsigned long flags, nr_pages;
> @@ -279,6 +279,157 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>  	return NULL;
>  }
>
> +/**
> + * isolate_single_pageblock() -- tries to isolate a pageblock that might be
> + * within a free or in-use page.
> + * @boundary_pfn:		pageblock-aligned pfn that a page might cross
> + * @gfp_flags:			GFP flags used for migrating pages
> + * @isolate_before:	isolate the pageblock before the boundary_pfn
> + *
> + * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
> + * pageblock. When not all pageblocks within a page are isolated at the same
> + * time, free page accounting can go wrong. For example, in the case of
> + * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
> + * [         MAX_ORDER-1         ]
> + * [  pageblock0  |  pageblock1  ]
> + * When either pageblock is isolated, if it is a free page, the page is not
> + * split into separate migratetype lists, which is supposed to; if it is an
> + * in-use page and freed later, __free_one_page() does not split the free page
> + * either. The function handles this by splitting the free page or migrating
> + * the in-use page then splitting the free page.
> + */
> +static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
> +			bool isolate_before)
> +{
> +	unsigned char saved_mt;
> +	unsigned long start_pfn;
> +	unsigned long isolate_pageblock;
> +	unsigned long pfn;
> +	struct zone *zone;
> +
> +	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
> +
> +	if (isolate_before)
> +		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
> +	else
> +		isolate_pageblock = boundary_pfn;
> +
> +	/*
> +	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
> +	 * only isolating a subset of pageblocks from a bigger than pageblock
> +	 * free or in-use page. Also make sure all to-be-isolated pageblocks
> +	 * are within the same zone.
> +	 */
> +	zone  = page_zone(pfn_to_page(isolate_pageblock));
> +	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
> +				      zone->zone_start_pfn);
> +
> +	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
> +
> +	/*
> +	 * Bail out early when the to-be-isolated pageblock does not form
> +	 * a free or in-use page across boundary_pfn:
> +	 *
> +	 * 1. isolate before boundary_pfn: the page after is not online
> +	 * 2. isolate after boundary_pfn: the page before is not online
> +	 *
> +	 * This also ensures correctness. Without it, when isolate after
> +	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
> +	 * __first_valid_page() will return unexpected NULL in the for loop
> +	 * below.
> +	 */
> +	if (isolate_before) {
> +		if (!pfn_to_online_page(boundary_pfn))
> +			return 0;
> +	} else {
> +		if (!pfn_to_online_page(boundary_pfn - 1))
> +			return 0;
> +	}
> +
> +	for (pfn = start_pfn; pfn < boundary_pfn;) {
> +		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
> +
> +		VM_BUG_ON(!page);
> +		pfn = page_to_pfn(page);
> +		/*
> +		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
> +		 * free pages in [start_pfn, boundary_pfn), its head page will
> +		 * always be in the range.
> +		 */
> +		if (PageBuddy(page)) {
> +			int order = buddy_order(page);
> +
> +			if (pfn + (1UL << order) > boundary_pfn)
> +				split_free_page(page, order, boundary_pfn - pfn);
> +			pfn += (1UL << order);
> +			continue;
> +		}
> +		/*
> +		 * migrate compound pages then let the free page handling code
> +		 * above do the rest. If migration is not enabled, just fail.
> +		 */
> +		if (PageHuge(page) || PageTransCompound(page)) {
> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +			unsigned long nr_pages = compound_nr(page);
> +			int order = compound_order(page);
> +			struct page *head = compound_head(page);
> +			unsigned long head_pfn = page_to_pfn(head);
> +			int ret;
> +			struct compact_control cc = {
> +				.nr_migratepages = 0,
> +				.order = -1,
> +				.zone = page_zone(pfn_to_page(head_pfn)),
> +				.mode = MIGRATE_SYNC,
> +				.ignore_skip_hint = true,
> +				.no_set_skip_hint = true,
> +				.gfp_mask = gfp_flags,
> +				.alloc_contig = true,
> +			};
> +			INIT_LIST_HEAD(&cc.migratepages);
> +
> +			if (head_pfn + nr_pages < boundary_pfn) {
> +				pfn += nr_pages;
> +				continue;
> +			}
> +
> +			ret = __alloc_contig_migrate_range(&cc, head_pfn,
> +						head_pfn + nr_pages);
> +
> +			if (ret)
> +				goto failed;
> +			/*
> +			 * reset pfn, let the free page handling code above
> +			 * split the free page to the right migratetype list.
> +			 *
> +			 * head_pfn is not used here as a hugetlb page order
> +			 * can be bigger than MAX_ORDER-1, but after it is
> +			 * freed, the free page order is not. Use pfn within
> +			 * the range to find the head of the free page and
> +			 * reset order to 0 if a hugetlb page with
> +			 * >MAX_ORDER-1 order is encountered.
> +			 */
> +			if (order > MAX_ORDER-1)
> +				order = 0;
> +			while (!PageBuddy(pfn_to_page(pfn))) {
> +				order++;
> +				pfn &= ~0UL << order;
> +			}
> +			continue;
> +#else
> +			goto failed;
> +#endif
> +		}
> +
> +		pfn++;
> +	}
> +	return 0;
> +failed:
> +	/* restore the original migratetype */
> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
> +	return -EBUSY;
> +}
> +
>  /**
>   * start_isolate_page_range() - make page-allocation-type of range of pages to
>   * be MIGRATE_ISOLATE.
> @@ -293,6 +444,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   *					 and PageOffline() pages.
>   *			REPORT_FAILURE - report details about the failure to
>   *			isolate the range
> + * @gfp_flags:		GFP flags used for migrating pages that sit across the
> + *			range boundaries.
>   *
>   * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>   * the range will never be allocated. Any free pages and pages freed in the
> @@ -301,6 +454,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * pages in the range finally, the caller have to free all pages in the range.
>   * test_page_isolated() can be used for test it.
>   *
> + * The function first tries to isolate the pageblocks at the beginning and end
> + * of the range, since there might be pages across the range boundaries.
> + * Afterwards, it isolates the rest of the range.
> + *
>   * There is no high level synchronization mechanism that prevents two threads
>   * from trying to isolate overlapping ranges. If this happens, one thread
>   * will notice pageblocks in the overlapping range already set to isolate.
> @@ -321,21 +478,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			     unsigned migratetype, int flags)
> +			     int migratetype, int flags, gfp_t gfp_flags)
>  {
>  	unsigned long pfn;
>  	struct page *page;
> +	int ret;
>
>  	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>  	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>
> -	for (pfn = start_pfn;
> -	     pfn < end_pfn;
> +	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
> +	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
> +	if (ret)
> +		return ret;
> +
> +	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
> +	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
> +	if (ret) {
> +		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
> +		return ret;
> +	}
> +
> +	/* skip isolated pageblocks at the beginning and end */
> +	for (pfn = start_pfn + pageblock_nr_pages;
> +	     pfn < end_pfn - pageblock_nr_pages;
>  	     pfn += pageblock_nr_pages) {
>  		page = __first_valid_page(pfn, pageblock_nr_pages);
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
>  			undo_isolate_page_range(start_pfn, pfn, migratetype);
> +			unset_migratetype_isolate(
> +				pfn_to_page(end_pfn - pageblock_nr_pages),
> +				migratetype);
>  			return -EBUSY;
>  		}
>  	}
> @@ -346,7 +520,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>   * Make isolated pages available again.
>   */
>  void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			    unsigned migratetype)
> +			    int migratetype)
>  {
>  	unsigned long pfn;
>  	struct page *page;
> -- 
> 2.35.1

Qian hit a bug caused by this series https://lore.kernel.org/linux-mm/20220426201855.GA1014@qian/
and the fix is:

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 75e454f5cf45..b3f074d1682e 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -367,58 +367,67 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 		}
 		/*
 		 * migrate compound pages then let the free page handling code
-		 * above do the rest. If migration is not enabled, just fail.
+		 * above do the rest. If migration is not possible, just fail.
 		 */
-		if (PageHuge(page) || PageTransCompound(page)) {
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+		if (PageCompound(page)) {
 			unsigned long nr_pages = compound_nr(page);
-			int order = compound_order(page);
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);
-			int ret;
-			struct compact_control cc = {
-				.nr_migratepages = 0,
-				.order = -1,
-				.zone = page_zone(pfn_to_page(head_pfn)),
-				.mode = MIGRATE_SYNC,
-				.ignore_skip_hint = true,
-				.no_set_skip_hint = true,
-				.gfp_mask = gfp_flags,
-				.alloc_contig = true,
-			};
-			INIT_LIST_HEAD(&cc.migratepages);

 			if (head_pfn + nr_pages < boundary_pfn) {
-				pfn += nr_pages;
+				pfn = head_pfn + nr_pages;
 				continue;
 			}
-
-			ret = __alloc_contig_migrate_range(&cc, head_pfn,
-						head_pfn + nr_pages);
-
-			if (ret)
-				goto failed;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
 			/*
-			 * reset pfn, let the free page handling code above
-			 * split the free page to the right migratetype list.
-			 *
-			 * head_pfn is not used here as a hugetlb page order
-			 * can be bigger than MAX_ORDER-1, but after it is
-			 * freed, the free page order is not. Use pfn within
-			 * the range to find the head of the free page and
-			 * reset order to 0 if a hugetlb page with
-			 * >MAX_ORDER-1 order is encountered.
+			 * hugetlb, lru compound (THP), and movable compound pages
+			 * can be migrated. Otherwise, fail the isolation.
 			 */
-			if (order > MAX_ORDER-1)
+			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+				int order;
+				unsigned long outer_pfn;
+				int ret;
+				struct compact_control cc = {
+					.nr_migratepages = 0,
+					.order = -1,
+					.zone = page_zone(pfn_to_page(head_pfn)),
+					.mode = MIGRATE_SYNC,
+					.ignore_skip_hint = true,
+					.no_set_skip_hint = true,
+					.gfp_mask = gfp_flags,
+					.alloc_contig = true,
+				};
+				INIT_LIST_HEAD(&cc.migratepages);
+
+				ret = __alloc_contig_migrate_range(&cc, head_pfn,
+							head_pfn + nr_pages);
+
+				if (ret)
+					goto failed;
+				/*
+				 * reset pfn to the head of the free page, so
+				 * that the free page handling code above can split
+				 * the free page to the right migratetype list.
+				 *
+				 * head_pfn is not used here as a hugetlb page order
+				 * can be bigger than MAX_ORDER-1, but after it is
+				 * freed, the free page order is not. Use pfn within
+				 * the range to find the head of the free page.
+				 */
 				order = 0;
-			while (!PageBuddy(pfn_to_page(pfn))) {
-				order++;
-				pfn &= ~0UL << order;
-			}
-			continue;
-#else
-			goto failed;
+				outer_pfn = pfn;
+				while (!PageBuddy(pfn_to_page(outer_pfn))) {
+					if (++order >= MAX_ORDER) {
+						outer_pfn = pfn;
+						break;
+					}
+					outer_pfn &= ~0UL << order;
+				}
+				pfn = outer_pfn;
+				continue;
+			} else
 #endif
+				goto failed;
 		}

 		pfn++;
-- 
2.35.1




The fixed-up patch is below for easy review purpose:



From fce466e89e50bcb0ebb56d7809db1b8bbea47628 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 26 Apr 2022 23:00:33 -0400
Subject: [PATCH] mm: make alloc_contig_range work at pageblock granularity

alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
merging pageblocks with different migratetypes.  It might unnecessarily
convert extra pageblocks at the beginning and at the end of the range.
Change alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified by alloc_contig_range().  Because these
partially isolated pages causes free page accounting issues.  The free
pages will be split and freed into separate migratetype lists; the in-use
pages will be migrated then the freed pages will be handled in the
aforementioned way.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |   4 +-
 mm/internal.h                  |   6 +
 mm/memory_hotplug.c            |   3 +-
 mm/page_alloc.c                |  54 +++++++--
 mm/page_isolation.c            | 193 ++++++++++++++++++++++++++++++++-
 5 files changed, 242 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index e14eddf6741a..5456b7be38ae 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned migratetype, int flags);
+			 int migratetype, int flags, gfp_t gfp_flags);

 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
@@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  */
 void
 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			unsigned migratetype);
+			int migratetype);

 /*
  * Test all pages in [start_pfn, end_pfn) are isolated or not.
diff --git a/mm/internal.h b/mm/internal.h
index 919fa07e1031..0667abd57634 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 			  phys_addr_t min_addr,
 			  int nid, bool exact_nid);

+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA

 /*
@@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
 int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
+
+int __alloc_contig_migrate_range(struct compact_control *cc,
+					unsigned long start, unsigned long end);
 #endif
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4c6065e5d274..9f8ae4cb77ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
-				       MEMORY_OFFLINE | REPORT_FAILURE);
+				       MEMORY_OFFLINE | REPORT_FAILURE,
+				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
 	if (ret) {
 		reason = "failure to isolate range";
 		goto failed_removal_pcplists_disabled;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 93dbe05a6029..6a0d1746c095 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
 		page_reporting_notify_free(order);
 }

+/**
+ * split_free_page() -- split a free page at split_pfn_offset
+ * @free_page:		the original free page
+ * @order:		the order of the page
+ * @split_pfn_offset:	split offset within the page
+ *
+ * It is used when the free page crosses two pageblocks with different migratetypes
+ * at split_pfn_offset within the page. The split free page will be put into
+ * separate migratetype lists afterwards. Otherwise, the function achieves
+ * nothing.
+ */
+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset)
+{
+	struct zone *zone = page_zone(free_page);
+	unsigned long free_page_pfn = page_to_pfn(free_page);
+	unsigned long pfn;
+	unsigned long flags;
+	int free_page_order;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	del_page_from_free_list(free_page, zone, order);
+	for (pfn = free_page_pfn;
+	     pfn < free_page_pfn + (1UL << order);) {
+		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
+
+		free_page_order = ffs(split_pfn_offset) - 1;
+		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
+				mt, FPI_NONE);
+		pfn += 1UL << free_page_order;
+		split_pfn_offset -= (1UL << free_page_order);
+		/* we have done the first part, now switch to second part */
+		if (split_pfn_offset == 0)
+			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
 #endif

 /* [start, end) must belong to a single zone. */
-static int __alloc_contig_migrate_range(struct compact_control *cc,
+int __alloc_contig_migrate_range(struct compact_control *cc,
 					unsigned long start, unsigned long end)
 {
 	/* This function is based on compact_zone() from compaction.c. */
@@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	unsigned int order;
+	int order;
 	int ret = 0;

 	struct compact_control cc = {
@@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * What we do here is we mark all pageblocks in range as
 	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
 	 * have different sizes, and due to the way page allocator
-	 * work, we align the range to biggest of the two pages so
-	 * that page allocator won't try to merge buddies from
-	 * different pageblocks and change MIGRATE_ISOLATE to some
-	 * other migration type.
+	 * work, start_isolate_page_range() has special handlings for this.
 	 *
 	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
 	 * migrate the pages from an unaligned range (ie. pages that
-	 * we are interested in).  This will put all the pages in
+	 * we are interested in). This will put all the pages in
 	 * range back to page allocator as MIGRATE_ISOLATE.
 	 *
 	 * When this is done, we take the pages in range from page
@@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */

 	ret = start_isolate_page_range(pfn_max_align_down(start),
-				       pfn_max_align_up(end), migratetype, 0);
+				pfn_max_align_up(end), migratetype, 0, gfp_mask);
 	if (ret)
-		return ret;
+		goto done;

 	drain_all_pages(cc.zone);

@@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	ret = 0;

 	/*
-	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
+	 * Pages from [start, end) are within a pageblock_nr_pages
 	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
 	 * more, all pages in [start, end) are free in page allocator.
 	 * What we are going to do is to allocate all pages from
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c2f7a8bb634d..8a0f16d2e4c3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	return -EBUSY;
 }

-static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
+static void unset_migratetype_isolate(struct page *page, int migratetype)
 {
 	struct zone *zone;
 	unsigned long flags, nr_pages;
@@ -279,6 +279,166 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 	return NULL;
 }

+/**
+ * isolate_single_pageblock() -- tries to isolate a pageblock that might be
+ * within a free or in-use page.
+ * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @gfp_flags:			GFP flags used for migrating pages
+ * @isolate_before:	isolate the pageblock before the boundary_pfn
+ *
+ * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
+ * pageblock. When not all pageblocks within a page are isolated at the same
+ * time, free page accounting can go wrong. For example, in the case of
+ * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
+ * [         MAX_ORDER-1         ]
+ * [  pageblock0  |  pageblock1  ]
+ * When either pageblock is isolated, if it is a free page, the page is not
+ * split into separate migratetype lists, which is supposed to; if it is an
+ * in-use page and freed later, __free_one_page() does not split the free page
+ * either. The function handles this by splitting the free page or migrating
+ * the in-use page then splitting the free page.
+ */
+static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
+			bool isolate_before)
+{
+	unsigned char saved_mt;
+	unsigned long start_pfn;
+	unsigned long isolate_pageblock;
+	unsigned long pfn;
+	struct zone *zone;
+
+	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
+
+	if (isolate_before)
+		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
+	else
+		isolate_pageblock = boundary_pfn;
+
+	/*
+	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
+	 * only isolating a subset of pageblocks from a bigger than pageblock
+	 * free or in-use page. Also make sure all to-be-isolated pageblocks
+	 * are within the same zone.
+	 */
+	zone  = page_zone(pfn_to_page(isolate_pageblock));
+	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
+				      zone->zone_start_pfn);
+
+	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
+	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
+
+	/*
+	 * Bail out early when the to-be-isolated pageblock does not form
+	 * a free or in-use page across boundary_pfn:
+	 *
+	 * 1. isolate before boundary_pfn: the page after is not online
+	 * 2. isolate after boundary_pfn: the page before is not online
+	 *
+	 * This also ensures correctness. Without it, when isolate after
+	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
+	 * __first_valid_page() will return unexpected NULL in the for loop
+	 * below.
+	 */
+	if (isolate_before) {
+		if (!pfn_to_online_page(boundary_pfn))
+			return 0;
+	} else {
+		if (!pfn_to_online_page(boundary_pfn - 1))
+			return 0;
+	}
+
+	for (pfn = start_pfn; pfn < boundary_pfn;) {
+		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
+
+		VM_BUG_ON(!page);
+		pfn = page_to_pfn(page);
+		/*
+		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
+		 * free pages in [start_pfn, boundary_pfn), its head page will
+		 * always be in the range.
+		 */
+		if (PageBuddy(page)) {
+			int order = buddy_order(page);
+
+			if (pfn + (1UL << order) > boundary_pfn)
+				split_free_page(page, order, boundary_pfn - pfn);
+			pfn += (1UL << order);
+			continue;
+		}
+		/*
+		 * migrate compound pages then let the free page handling code
+		 * above do the rest. If migration is not possible, just fail.
+		 */
+		if (PageCompound(page)) {
+			unsigned long nr_pages = compound_nr(page);
+			struct page *head = compound_head(page);
+			unsigned long head_pfn = page_to_pfn(head);
+
+			if (head_pfn + nr_pages < boundary_pfn) {
+				pfn = head_pfn + nr_pages;
+				continue;
+			}
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+			/*
+			 * hugetlb, lru compound (THP), and movable compound pages
+			 * can be migrated. Otherwise, fail the isolation.
+			 */
+			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+				int order;
+				unsigned long outer_pfn;
+				int ret;
+				struct compact_control cc = {
+					.nr_migratepages = 0,
+					.order = -1,
+					.zone = page_zone(pfn_to_page(head_pfn)),
+					.mode = MIGRATE_SYNC,
+					.ignore_skip_hint = true,
+					.no_set_skip_hint = true,
+					.gfp_mask = gfp_flags,
+					.alloc_contig = true,
+				};
+				INIT_LIST_HEAD(&cc.migratepages);
+
+				ret = __alloc_contig_migrate_range(&cc, head_pfn,
+							head_pfn + nr_pages);
+
+				if (ret)
+					goto failed;
+				/*
+				 * reset pfn to the head of the free page, so
+				 * that the free page handling code above can split
+				 * the free page to the right migratetype list.
+				 *
+				 * head_pfn is not used here as a hugetlb page order
+				 * can be bigger than MAX_ORDER-1, but after it is
+				 * freed, the free page order is not. Use pfn within
+				 * the range to find the head of the free page.
+				 */
+				order = 0;
+				outer_pfn = pfn;
+				while (!PageBuddy(pfn_to_page(outer_pfn))) {
+					if (++order >= MAX_ORDER) {
+						outer_pfn = pfn;
+						break;
+					}
+					outer_pfn &= ~0UL << order;
+				}
+				pfn = outer_pfn;
+				continue;
+			} else
+#endif
+				goto failed;
+		}
+
+		pfn++;
+	}
+	return 0;
+failed:
+	/* restore the original migratetype */
+	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
+	return -EBUSY;
+}
+
 /**
  * start_isolate_page_range() - make page-allocation-type of range of pages to
  * be MIGRATE_ISOLATE.
@@ -293,6 +453,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *					 and PageOffline() pages.
  *			REPORT_FAILURE - report details about the failure to
  *			isolate the range
+ * @gfp_flags:		GFP flags used for migrating pages that sit across the
+ *			range boundaries.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -301,6 +463,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * pages in the range finally, the caller have to free all pages in the range.
  * test_page_isolated() can be used for test it.
  *
+ * The function first tries to isolate the pageblocks at the beginning and end
+ * of the range, since there might be pages across the range boundaries.
+ * Afterwards, it isolates the rest of the range.
+ *
  * There is no high level synchronization mechanism that prevents two threads
  * from trying to isolate overlapping ranges. If this happens, one thread
  * will notice pageblocks in the overlapping range already set to isolate.
@@ -321,21 +487,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			     unsigned migratetype, int flags)
+			     int migratetype, int flags, gfp_t gfp_flags)
 {
 	unsigned long pfn;
 	struct page *page;
+	int ret;

 	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
 	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));

-	for (pfn = start_pfn;
-	     pfn < end_pfn;
+	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
+	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
+	if (ret)
+		return ret;
+
+	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
+	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
+	if (ret) {
+		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
+		return ret;
+	}
+
+	/* skip isolated pageblocks at the beginning and end */
+	for (pfn = start_pfn + pageblock_nr_pages;
+	     pfn < end_pfn - pageblock_nr_pages;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			unset_migratetype_isolate(
+				pfn_to_page(end_pfn - pageblock_nr_pages),
+				migratetype);
 			return -EBUSY;
 		}
 	}
@@ -346,7 +529,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  * Make isolated pages available again.
  */
 void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			    unsigned migratetype)
+			    int migratetype)
 {
 	unsigned long pfn;
 	struct page *page;
-- 
2.35.1

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
@ 2022-05-10  1:03   ` Andrew Morton
  2022-04-25 14:31 ` [PATCH v11 2/6] mm: page_isolation: check specified range for unmovable pages Zi Yan
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2022-05-10  1:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, David Hildenbrand, linux-mm, linux-kernel,
	virtualization, Vlastimil Babka, Mel Gorman, Eric Ren,
	Mike Rapoport, Oscar Salvador, Christophe Leroy

On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan <zi.yan@sent.com> wrote:

> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1].

I'm thinking this looks ready to be merged into mm-stable later this week, for
the 5.19-rc1 merge window.

I believe the build error at
https://lkml.kernel.org/r/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com
was addressed in ARM?

I have one -fix to be squashed,
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
@ 2022-05-10  1:03   ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2022-05-10  1:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Christophe Leroy, virtualization, linux-mm,
	Mike Rapoport, Eric Ren, Zi Yan, Mel Gorman, Vlastimil Babka,
	Oscar Salvador

On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan <zi.yan@sent.com> wrote:

> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1].

I'm thinking this looks ready to be merged into mm-stable later this week, for
the 5.19-rc1 merge window.

I believe the build error at
https://lkml.kernel.org/r/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com
was addressed in ARM?

I have one -fix to be squashed,
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-10  1:03   ` Andrew Morton
  (?)
@ 2022-05-10  1:07   ` Zi Yan
  -1 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-10  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On 9 May 2022, at 21:03, Andrew Morton wrote:

> On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan <zi.yan@sent.com> wrote:
>
>> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
>> and alloc_contig_range(). It prepares for my upcoming changes to make
>> MAX_ORDER adjustable at boot time[1].
>
> I'm thinking this looks ready to be merged into mm-stable later this week, for
> the 5.19-rc1 merge window.
>
> I believe the build error at
> https://lkml.kernel.org/r/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com
> was addressed in ARM?

Right. The warning is caused by CONFIG_ARM64_SME=y not this patchset,
see https://lore.kernel.org/all/YnGrbEt3oBBTly7u@qian/.

>
> I have one -fix to be squashed,
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch

Yes. Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-04-28 12:39                 ` Zi Yan
  2022-04-28 16:19                   ` Qian Cai
@ 2022-05-19 20:57                   ` Qian Cai
  2022-05-19 21:35                     ` Zi Yan
  1 sibling, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-05-19 20:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
> How about the one attached? I can apply it to next-20220428. Let me know
> if you are using a different branch. Thanks.

Zi, it turns out that the endless loop in isolate_single_pageblock() can
still be reproduced on today's linux-next tree by running the reproducer a
few times. With this debug patch applied, it keeps printing the same
values.

--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                };
                                INIT_LIST_HEAD(&cc.migratepages);

+                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
                                ret = __alloc_contig_migrate_range(&cc, head_pfn,
                                                        head_pfn + nr_pages);

 isolate_single_pageblock: 179 callbacks suppressed
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-19 20:57                   ` Qian Cai
@ 2022-05-19 21:35                     ` Zi Yan
  2022-05-19 23:24                       ` Zi Yan
  2022-05-20 11:30                       ` Qian Cai
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-19 21:35 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3714 bytes --]

On 19 May 2022, at 16:57, Qian Cai wrote:

> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>> How about the one attached? I can apply it to next-20220428. Let me know
>> if you are using a different branch. Thanks.
>
> Zi, it turns out that the endless loop in isolate_single_pageblock() can
> still be reproduced on today's linux-next tree by running the reproducer a
> few times. With this debug patch applied, it keeps printing the same
> values.
>
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>                                 };
>                                 INIT_LIST_HEAD(&cc.migratepages);
>
> +                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
>                                 ret = __alloc_contig_migrate_range(&cc, head_pfn,
>                                                         head_pfn + nr_pages);
>
>  isolate_single_pageblock: 179 callbacks suppressed
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896

Hi Qian,

Thanks for your testing.

Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
page caused the infinite loop, because the page was not migrated and the code kept
retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
code the page cannot be migrated and the code will goto failed without retrying. It will be
great you can share what exactly has run after boot, so that I can reproduce locally to
identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.

Can you also try the patch below to see if it fixes the infinite loop?

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..abde1877bbcb 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                order = 0;
                                outer_pfn = pfn;
                                while (!PageBuddy(pfn_to_page(outer_pfn))) {
-                                       if (++order >= MAX_ORDER) {
-                                               outer_pfn = pfn;
-                                               break;
-                                       }
+                                       /* abort if the free page cannot be found */
+                                       if (++order >= MAX_ORDER)
+                                               goto failed;
                                        outer_pfn &= ~0UL << order;
                                }
                                pfn = outer_pfn;

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-19 21:35                     ` Zi Yan
@ 2022-05-19 23:24                       ` Zi Yan
  2022-05-20 11:30                       ` Qian Cai
  1 sibling, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-19 23:24 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 4444 bytes --]

On 19 May 2022, at 17:35, Zi Yan wrote:

> On 19 May 2022, at 16:57, Qian Cai wrote:
>
>> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>>> How about the one attached? I can apply it to next-20220428. Let me know
>>> if you are using a different branch. Thanks.
>>
>> Zi, it turns out that the endless loop in isolate_single_pageblock() can
>> still be reproduced on today's linux-next tree by running the reproducer a
>> few times. With this debug patch applied, it keeps printing the same
>> values.
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>                                 };
>>                                 INIT_LIST_HEAD(&cc.migratepages);
>>
>> +                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
>>                                 ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>                                                         head_pfn + nr_pages);
>>
>>  isolate_single_pageblock: 179 callbacks suppressed
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>
> Hi Qian,
>
> Thanks for your testing.
>
> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
> page caused the infinite loop, because the page was not migrated and the code kept
> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
> code the page cannot be migrated and the code will goto failed without retrying. It will be
> great you can share what exactly has run after boot, so that I can reproduce locally to
> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>
> Can you also try the patch below to see if it fixes the infinite loop?

I also have an off-by-one error in the code. The error caused unnecessary effort of
trying to migrate some pages. Your endless loop case seems to be caused by it.
Can you actually try the patch below? Thanks.

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..5c8099bb822f 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -374,7 +374,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                        struct page *head = compound_head(page);
                        unsigned long head_pfn = page_to_pfn(head);

-                       if (head_pfn + nr_pages < boundary_pfn) {
+                       if (head_pfn + nr_pages <= boundary_pfn) {
                                pfn = head_pfn + nr_pages;
                                continue;
                        }
@@ -417,10 +417,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                order = 0;
                                outer_pfn = pfn;
                                while (!PageBuddy(pfn_to_page(outer_pfn))) {
-                                       if (++order >= MAX_ORDER) {
-                                               outer_pfn = pfn;
-                                               break;
-                                       }
+                                       if (++order >= MAX_ORDER)
+                                               goto failed;
                                        outer_pfn &= ~0UL << order;
                                }
                                pfn = outer_pfn;

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-19 21:35                     ` Zi Yan
  2022-05-19 23:24                       ` Zi Yan
@ 2022-05-20 11:30                       ` Qian Cai
  2022-05-20 13:43                         ` Zi Yan
  1 sibling, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-05-20 11:30 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
> page caused the infinite loop, because the page was not migrated and the code kept
> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
> code the page cannot be migrated and the code will goto failed without retrying. It will be
> great you can share what exactly has run after boot, so that I can reproduce locally to
> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.

The reproducer is just to run the same script I shared with you previously
multiple times instead. It is still quite reproducible here as it usually
happens within a hour.

$ for i in `seq 1 100`; do ./flip_mem.py; done

> Can you also try the patch below to see if it fixes the infinite loop?
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b3f074d1682e..abde1877bbcb 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>                                 order = 0;
>                                 outer_pfn = pfn;
>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
> -                                       if (++order >= MAX_ORDER) {
> -                                               outer_pfn = pfn;
> -                                               break;
> -                                       }
> +                                       /* abort if the free page cannot be found */
> +                                       if (++order >= MAX_ORDER)
> +                                               goto failed;
>                                         outer_pfn &= ~0UL << order;
>                                 }
>                                 pfn = outer_pfn;
> 

Can you explain a bit how this patch is the right thing to do here? I am a
little bit worry about shooting into the dark. Otherwise, I'll be running
the off-by-one part over the weekend to see if that helps.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 11:30                       ` Qian Cai
@ 2022-05-20 13:43                         ` Zi Yan
  2022-05-20 14:13                           ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-20 13:43 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2938 bytes --]

On 20 May 2022, at 7:30, Qian Cai wrote:

> On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
>> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
>> page caused the infinite loop, because the page was not migrated and the code kept
>> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
>> code the page cannot be migrated and the code will goto failed without retrying. It will be
>> great you can share what exactly has run after boot, so that I can reproduce locally to
>> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>
> The reproducer is just to run the same script I shared with you previously
> multiple times instead. It is still quite reproducible here as it usually
> happens within a hour.
>
> $ for i in `seq 1 100`; do ./flip_mem.py; done
>
>> Can you also try the patch below to see if it fixes the infinite loop?
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index b3f074d1682e..abde1877bbcb 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>                                 order = 0;
>>                                 outer_pfn = pfn;
>>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
>> -                                       if (++order >= MAX_ORDER) {
>> -                                               outer_pfn = pfn;
>> -                                               break;
>> -                                       }
>> +                                       /* abort if the free page cannot be found */
>> +                                       if (++order >= MAX_ORDER)
>> +                                               goto failed;
>>                                         outer_pfn &= ~0UL << order;
>>                                 }
>>                                 pfn = outer_pfn;
>>
>
> Can you explain a bit how this patch is the right thing to do here? I am a
> little bit worry about shooting into the dark. Otherwise, I'll be running
> the off-by-one part over the weekend to see if that helps.

The code kept retrying to migrate a 512-page compound page, so it seems to me
that __alloc_contig_migrate_range() did not migrate the page but returned
0 every time, otherwise, if (ret) goto failed; would bail out of the loop
already. The original code above assumed a free page can always be found after
__alloc_contig_migrate_range(), so it will retry if no free page is found.
But that assumption is not true from your infinite loop result, the new
code quits retrying when no free page can be found.

I will dig into it deeper to make sure it is the correct fix. I will
update you when I am done.

Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 13:43                         ` Zi Yan
@ 2022-05-20 14:13                           ` Zi Yan
  2022-05-20 19:41                             ` Qian Cai
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-20 14:13 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On 20 May 2022, at 9:43, Zi Yan wrote:

> On 20 May 2022, at 7:30, Qian Cai wrote:
>
>> On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
>>> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
>>> page caused the infinite loop, because the page was not migrated and the code kept
>>> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
>>> code the page cannot be migrated and the code will goto failed without retrying. It will be
>>> great you can share what exactly has run after boot, so that I can reproduce locally to
>>> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>>
>> The reproducer is just to run the same script I shared with you previously
>> multiple times instead. It is still quite reproducible here as it usually
>> happens within a hour.
>>
>> $ for i in `seq 1 100`; do ./flip_mem.py; done

Also, do you mind providing the page dump of the 512-page compound page? I would like
to know what page caused the issue.

Thanks.

>>
>>> Can you also try the patch below to see if it fixes the infinite loop?
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index b3f074d1682e..abde1877bbcb 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>>                                 order = 0;
>>>                                 outer_pfn = pfn;
>>>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>> -                                       if (++order >= MAX_ORDER) {
>>> -                                               outer_pfn = pfn;
>>> -                                               break;
>>> -                                       }
>>> +                                       /* abort if the free page cannot be found */
>>> +                                       if (++order >= MAX_ORDER)
>>> +                                               goto failed;
>>>                                         outer_pfn &= ~0UL << order;
>>>                                 }
>>>                                 pfn = outer_pfn;
>>>
>>
>> Can you explain a bit how this patch is the right thing to do here? I am a
>> little bit worry about shooting into the dark. Otherwise, I'll be running
>> the off-by-one part over the weekend to see if that helps.
>
> The code kept retrying to migrate a 512-page compound page, so it seems to me
> that __alloc_contig_migrate_range() did not migrate the page but returned
> 0 every time, otherwise, if (ret) goto failed; would bail out of the loop
> already. The original code above assumed a free page can always be found after
> __alloc_contig_migrate_range(), so it will retry if no free page is found.
> But that assumption is not true from your infinite loop result, the new
> code quits retrying when no free page can be found.
>
> I will dig into it deeper to make sure it is the correct fix. I will
> update you when I am done.
>
> Thanks.
>
> --
> Best Regards,
> Yan, Zi


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 14:13                           ` Zi Yan
@ 2022-05-20 19:41                             ` Qian Cai
  2022-05-20 21:56                               ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-05-20 19:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Fri, May 20, 2022 at 10:13:51AM -0400, Zi Yan wrote:
> Also, do you mind providing the page dump of the 512-page compound page? I would like
> to know what page caused the issue.

 page last allocated via order 9, migratetype Movable, gfp_mask 0x3c24ca(GFP_TRANSHUGE|__GFP_THISNODE), pid 831, tgid 831 (khugepaged), ts 3899865924520, free_ts 3821953009040
  post_alloc_hook
  get_page_from_freelist
  __alloc_pages
  khugepaged_alloc_page
  collapse_huge_page
  khugepaged_scan_pmd
  khugepaged_scan_mm_slot
  khugepaged
  kthread
  ret_from_fork
 page last free stack trace:
  free_pcp_prepare
  free_unref_page
  free_compound_page
  free_transhuge_page
  __put_compound_page
  release_pages
  free_pages_and_swap_cache
  tlb_batch_pages_flush
  tlb_finish_mmu
  exit_mmap
  __mmput
  mmput
  exit_mm
  do_exit
  do_group_exit
  __arm64_sys_exit_group

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 19:41                             ` Qian Cai
@ 2022-05-20 21:56                               ` Zi Yan
  2022-05-20 23:41                                 ` Qian Cai
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-20 21:56 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

On 20 May 2022, at 15:41, Qian Cai wrote:

> On Fri, May 20, 2022 at 10:13:51AM -0400, Zi Yan wrote:
>> Also, do you mind providing the page dump of the 512-page compound page? I would like
>> to know what page caused the issue.
>
>  page last allocated via order 9, migratetype Movable, gfp_mask 0x3c24ca(GFP_TRANSHUGE|__GFP_THISNODE), pid 831, tgid 831 (khugepaged), ts 3899865924520, free_ts 3821953009040
>   post_alloc_hook
>   get_page_from_freelist
>   __alloc_pages
>   khugepaged_alloc_page
>   collapse_huge_page
>   khugepaged_scan_pmd
>   khugepaged_scan_mm_slot
>   khugepaged
>   kthread
>   ret_from_fork
>  page last free stack trace:
>   free_pcp_prepare
>   free_unref_page
>   free_compound_page
>   free_transhuge_page
>   __put_compound_page
>   release_pages
>   free_pages_and_swap_cache
>   tlb_batch_pages_flush
>   tlb_finish_mmu
>   exit_mmap
>   __mmput
>   mmput
>   exit_mm
>   do_exit
>   do_group_exit
>   __arm64_sys_exit_group

Do you have the page information like refcount, map count, mapping, index, and
page flags? That would be more helpful. Thanks.

I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
x86_64 VM and bare metal.

What ARM machine are you using? I wonder if I am able to get one locally.

Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 21:56                               ` Zi Yan
@ 2022-05-20 23:41                                 ` Qian Cai
  2022-05-22 16:54                                   ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Qian Cai @ 2022-05-20 23:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
> Do you have the page information like refcount, map count, mapping, index, and
> page flags? That would be more helpful. Thanks.

page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
memcg:ffff40026005a000
anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000

> I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
> x86_64 VM and bare metal.
> 
> What ARM machine are you using? I wonder if I am able to get one locally.

Ampere Altra.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-20 23:41                                 ` Qian Cai
@ 2022-05-22 16:54                                   ` Zi Yan
  2022-05-22 19:33                                     ` Zi Yan
  2022-05-24 16:59                                     ` Qian Cai
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-22 16:54 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 7649 bytes --]

On 20 May 2022, at 19:41, Qian Cai wrote:

> On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
>> Do you have the page information like refcount, map count, mapping, index, and
>> page flags? That would be more helpful. Thanks.
>
> page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
> head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
> memcg:ffff40026005a000
> anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
> raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
> raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000
>

This is a PTE-mapped THP, unless <393 subpages are mapped, meaning extra refcount is present,
the page should be migratable. Even if it is not migratible due to the extra pin,
__alloc_contig_migrate_range() will return non-zero and bails out the code.
No idea why it caused the infinite loop.

>> I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
>> x86_64 VM and bare metal.
>>
>> What ARM machine are you using? I wonder if I am able to get one locally.
>
> Ampere Altra.

Sorry, I have no access to such a machine right now and cannot afford to buy one.

Can you try the patch below on top of linux-next to see if it fixes the infinite loop issue?
Thanks.

1. split_free_page() change is irrelevant but to make the code more robust.
2. using set_migratetype_isolate() in isolate_single_pageblock() is to properly mark the pageblock
MIGRATE_ISOLATE.
3. setting to-be-migrated page's pageblock to MIGRATE_ISOLATE is to avoid a possible race
that another thread might take the free page after migration.
4. off-by-one fix and no retry if free page is not found after migration like I added before.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4dcfa0ceca45..ad8f73b00466 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1122,13 +1122,16 @@ void split_free_page(struct page *free_page,
 	unsigned long flags;
 	int free_page_order;

+	if (split_pfn_offset == 0)
+		return;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	del_page_from_free_list(free_page, zone, order);
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

-		free_page_order = ffs(split_pfn_offset) - 1;
+		free_page_order = min(pfn ? __ffs(pfn) : order, __fls(split_pfn_offset));
 		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
 				mt, FPI_NONE);
 		pfn += 1UL << free_page_order;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..706915c9a380 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -283,6 +283,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * isolate_single_pageblock() -- tries to isolate a pageblock that might be
  * within a free or in-use page.
  * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @flags:			isolation flags
  * @gfp_flags:			GFP flags used for migrating pages
  * @isolate_before:	isolate the pageblock before the boundary_pfn
  *
@@ -298,14 +299,15 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * either. The function handles this by splitting the free page or migrating
  * the in-use page then splitting the free page.
  */
-static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
-			bool isolate_before)
+static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
+			gfp_t gfp_flags, bool isolate_before)
 {
 	unsigned char saved_mt;
 	unsigned long start_pfn;
 	unsigned long isolate_pageblock;
 	unsigned long pfn;
 	struct zone *zone;
+	int ret;

 	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));

@@ -325,7 +327,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				      zone->zone_start_pfn);

 	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
+	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+
+	if (ret)
+		return ret;

 	/*
 	 * Bail out early when the to-be-isolated pageblock does not form
@@ -374,7 +380,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);

-			if (head_pfn + nr_pages < boundary_pfn) {
+			if (head_pfn + nr_pages <= boundary_pfn) {
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
@@ -386,7 +392,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
 				int order;
 				unsigned long outer_pfn;
-				int ret;
+				int page_mt = get_pageblock_migratetype(page);
+				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -399,9 +406,31 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				};
 				INIT_LIST_HEAD(&cc.migratepages);

+				/*
+				 * XXX: mark the page as MIGRATE_ISOLATE so that
+				 * no one else can grab the freed page after migration.
+				 * Ideally, the page should be freed as two separate
+				 * pages to be added into separate migratetype free
+				 * lists.
+				 */
+				if (isolate_page) {
+					ret = set_migratetype_isolate(page, page_mt,
+						flags, head_pfn, boundary_pfn - 1);
+					if (ret)
+						goto failed;
+				}
+
 				ret = __alloc_contig_migrate_range(&cc, head_pfn,
 							head_pfn + nr_pages);

+				/*
+				 * restore the page's migratetype so that it can
+				 * be split into separate migratetype free lists
+				 * later.
+				 */
+				if (isolate_page)
+					unset_migratetype_isolate(page, page_mt);
+
 				if (ret)
 					goto failed;
 				/*
@@ -417,10 +446,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				order = 0;
 				outer_pfn = pfn;
 				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					if (++order >= MAX_ORDER) {
-						outer_pfn = pfn;
-						break;
-					}
+					/* stop if we cannot find the free page */
+					if (++order >= MAX_ORDER)
+						goto failed;
 					outer_pfn &= ~0UL << order;
 				}
 				pfn = outer_pfn;
@@ -435,7 +463,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 	return 0;
 failed:
 	/* restore the original migratetype */
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
+	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
 	return -EBUSY;
 }

@@ -496,12 +524,12 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	int ret;

 	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(isolate_start, gfp_flags, false);
+	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
 	if (ret)
 		return ret;

 	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, gfp_flags, true);
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
 	if (ret) {
 		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-22 16:54                                   ` Zi Yan
@ 2022-05-22 19:33                                     ` Zi Yan
  2022-05-24 16:59                                     ` Qian Cai
  1 sibling, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-22 19:33 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]

On 22 May 2022, at 12:54, Zi Yan wrote:

> On 20 May 2022, at 19:41, Qian Cai wrote:
>
>> On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
>>> Do you have the page information like refcount, map count, mapping, index, and
>>> page flags? That would be more helpful. Thanks.
>>
>> page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
>> head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
>> memcg:ffff40026005a000
>> anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
>> raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
>> raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000

OK. I replicated two scenarios, which can have the above page dump:
1. a PTE-mapped THP with 393 subpages mapped without any extra pin,
2. a PTE-mapped THP with 392 subpages mapped with an extra pin on the first subpage.

For scenario 1, there is no infinite looping on next-20220519 and next-20220520.

For scenario 2, an infinite loop happens on next-20220519, next-20220520, and next-20220520
with my fixup patch from another email, when the memory block, in which the page resides,
is being offlined. However, after reverting all my patches, the infinite loop remains.

So it looks to me that having an infinite loop during memory offline is not a regression
based on the experiments I have done. David Hildenbranch can correct me if I am wrong.
A better issue description, other than infinite loop during memory offlining, and a
better reproducer are needed for me to identify potential bugs in my code and fix them.

Of course, my fixup patch should be applied anyway.

Thanks for your testing.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.
  2022-05-22 16:54                                   ` Zi Yan
  2022-05-22 19:33                                     ` Zi Yan
@ 2022-05-24 16:59                                     ` Qian Cai
  1 sibling, 0 replies; 44+ messages in thread
From: Qian Cai @ 2022-05-24 16:59 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton

On Sun, May 22, 2022 at 12:54:04PM -0400, Zi Yan wrote:
> Can you try the patch below on top of linux-next to see if it fixes the infinite loop issue?
> Thanks.
> 
> 1. split_free_page() change is irrelevant but to make the code more robust.
> 2. using set_migratetype_isolate() in isolate_single_pageblock() is to properly mark the pageblock
> MIGRATE_ISOLATE.
> 3. setting to-be-migrated page's pageblock to MIGRATE_ISOLATE is to avoid a possible race
> that another thread might take the free page after migration.
> 4. off-by-one fix and no retry if free page is not found after migration like I added before.

Cool. I'll be running it this week and report back next week.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-04-29 13:54   ` Zi Yan
@ 2022-05-24 19:00     ` Zi Yan
  2022-05-25 17:41     ` Doug Berger
  1 sibling, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-24 19:00 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, linux-mm, Qian Cai
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	kernel test robot

[-- Attachment #1: Type: text/plain, Size: 42422 bytes --]

>
> From fce466e89e50bcb0ebb56d7809db1b8bbea47628 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Tue, 26 Apr 2022 23:00:33 -0400
> Subject: [PATCH] mm: make alloc_contig_range work at pageblock granularity
>
> alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
> merging pageblocks with different migratetypes.  It might unnecessarily
> convert extra pageblocks at the beginning and at the end of the range.
> Change alloc_contig_range() to work at pageblock granularity.
>
> Special handling is needed for free pages and in-use pages across the
> boundaries of the range specified by alloc_contig_range().  Because these
> partially isolated pages causes free page accounting issues.  The free
> pages will be split and freed into separate migratetype lists; the in-use
> pages will be migrated then the freed pages will be handled in the
> aforementioned way.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/page-isolation.h |   4 +-
>  mm/internal.h                  |   6 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/page_alloc.c                |  54 +++++++--
>  mm/page_isolation.c            | 193 ++++++++++++++++++++++++++++++++-
>  5 files changed, 242 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index e14eddf6741a..5456b7be38ae 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
>   */
>  int
>  start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			 unsigned migratetype, int flags);
> +			 int migratetype, int flags, gfp_t gfp_flags);
>
>  /*
>   * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
> @@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>   */
>  void
>  undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			unsigned migratetype);
> +			int migratetype);
>
>  /*
>   * Test all pages in [start_pfn, end_pfn) are isolated or not.
> diff --git a/mm/internal.h b/mm/internal.h
> index 919fa07e1031..0667abd57634 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>  			  phys_addr_t min_addr,
>  			  int nid, bool exact_nid);
>
> +void split_free_page(struct page *free_page,
> +				int order, unsigned long split_pfn_offset);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
>  /*
> @@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
>  int
>  isolate_migratepages_range(struct compact_control *cc,
>  			   unsigned long low_pfn, unsigned long end_pfn);
> +
> +int __alloc_contig_migrate_range(struct compact_control *cc,
> +					unsigned long start, unsigned long end);
>  #endif
>  int find_suitable_fallback(struct free_area *area, unsigned int order,
>  			int migratetype, bool only_stealable, bool *can_steal);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4c6065e5d274..9f8ae4cb77ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
> -				       MEMORY_OFFLINE | REPORT_FAILURE);
> +				       MEMORY_OFFLINE | REPORT_FAILURE,
> +				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>  	if (ret) {
>  		reason = "failure to isolate range";
>  		goto failed_removal_pcplists_disabled;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 93dbe05a6029..6a0d1746c095 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
>  		page_reporting_notify_free(order);
>  }
>
> +/**
> + * split_free_page() -- split a free page at split_pfn_offset
> + * @free_page:		the original free page
> + * @order:		the order of the page
> + * @split_pfn_offset:	split offset within the page
> + *
> + * It is used when the free page crosses two pageblocks with different migratetypes
> + * at split_pfn_offset within the page. The split free page will be put into
> + * separate migratetype lists afterwards. Otherwise, the function achieves
> + * nothing.
> + */
> +void split_free_page(struct page *free_page,
> +				int order, unsigned long split_pfn_offset)
> +{
> +	struct zone *zone = page_zone(free_page);
> +	unsigned long free_page_pfn = page_to_pfn(free_page);
> +	unsigned long pfn;
> +	unsigned long flags;
> +	int free_page_order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	del_page_from_free_list(free_page, zone, order);
> +	for (pfn = free_page_pfn;
> +	     pfn < free_page_pfn + (1UL << order);) {
> +		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> +
> +		free_page_order = ffs(split_pfn_offset) - 1;
> +		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
> +				mt, FPI_NONE);
> +		pfn += 1UL << free_page_order;
> +		split_pfn_offset -= (1UL << free_page_order);
> +		/* we have done the first part, now switch to second part */
> +		if (split_pfn_offset == 0)
> +			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple branches,
>   * try and check multiple fields with one check. The caller must do a detailed
> @@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
>  #endif
>
>  /* [start, end) must belong to a single zone. */
> -static int __alloc_contig_migrate_range(struct compact_control *cc,
> +int __alloc_contig_migrate_range(struct compact_control *cc,
>  					unsigned long start, unsigned long end)
>  {
>  	/* This function is based on compact_zone() from compaction.c. */
> @@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype, gfp_t gfp_mask)
>  {
>  	unsigned long outer_start, outer_end;
> -	unsigned int order;
> +	int order;
>  	int ret = 0;
>
>  	struct compact_control cc = {
> @@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * What we do here is we mark all pageblocks in range as
>  	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
>  	 * have different sizes, and due to the way page allocator
> -	 * work, we align the range to biggest of the two pages so
> -	 * that page allocator won't try to merge buddies from
> -	 * different pageblocks and change MIGRATE_ISOLATE to some
> -	 * other migration type.
> +	 * work, start_isolate_page_range() has special handlings for this.
>  	 *
>  	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
>  	 * migrate the pages from an unaligned range (ie. pages that
> -	 * we are interested in).  This will put all the pages in
> +	 * we are interested in). This will put all the pages in
>  	 * range back to page allocator as MIGRATE_ISOLATE.
>  	 *
>  	 * When this is done, we take the pages in range from page
> @@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 */
>
>  	ret = start_isolate_page_range(pfn_max_align_down(start),
> -				       pfn_max_align_up(end), migratetype, 0);
> +				pfn_max_align_up(end), migratetype, 0, gfp_mask);
>  	if (ret)
> -		return ret;
> +		goto done;
>
>  	drain_all_pages(cc.zone);
>
> @@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	ret = 0;
>
>  	/*
> -	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
> +	 * Pages from [start, end) are within a pageblock_nr_pages
>  	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
>  	 * more, all pages in [start, end) are free in page allocator.
>  	 * What we are going to do is to allocate all pages from
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c2f7a8bb634d..8a0f16d2e4c3 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  	return -EBUSY;
>  }
>
> -static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> +static void unset_migratetype_isolate(struct page *page, int migratetype)
>  {
>  	struct zone *zone;
>  	unsigned long flags, nr_pages;
> @@ -279,6 +279,166 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>  	return NULL;
>  }
>
> +/**
> + * isolate_single_pageblock() -- tries to isolate a pageblock that might be
> + * within a free or in-use page.
> + * @boundary_pfn:		pageblock-aligned pfn that a page might cross
> + * @gfp_flags:			GFP flags used for migrating pages
> + * @isolate_before:	isolate the pageblock before the boundary_pfn
> + *
> + * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
> + * pageblock. When not all pageblocks within a page are isolated at the same
> + * time, free page accounting can go wrong. For example, in the case of
> + * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
> + * [         MAX_ORDER-1         ]
> + * [  pageblock0  |  pageblock1  ]
> + * When either pageblock is isolated, if it is a free page, the page is not
> + * split into separate migratetype lists, which is supposed to; if it is an
> + * in-use page and freed later, __free_one_page() does not split the free page
> + * either. The function handles this by splitting the free page or migrating
> + * the in-use page then splitting the free page.
> + */
> +static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
> +			bool isolate_before)
> +{
> +	unsigned char saved_mt;
> +	unsigned long start_pfn;
> +	unsigned long isolate_pageblock;
> +	unsigned long pfn;
> +	struct zone *zone;
> +
> +	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
> +
> +	if (isolate_before)
> +		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
> +	else
> +		isolate_pageblock = boundary_pfn;
> +
> +	/*
> +	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
> +	 * only isolating a subset of pageblocks from a bigger than pageblock
> +	 * free or in-use page. Also make sure all to-be-isolated pageblocks
> +	 * are within the same zone.
> +	 */
> +	zone  = page_zone(pfn_to_page(isolate_pageblock));
> +	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
> +				      zone->zone_start_pfn);
> +
> +	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
> +
> +	/*
> +	 * Bail out early when the to-be-isolated pageblock does not form
> +	 * a free or in-use page across boundary_pfn:
> +	 *
> +	 * 1. isolate before boundary_pfn: the page after is not online
> +	 * 2. isolate after boundary_pfn: the page before is not online
> +	 *
> +	 * This also ensures correctness. Without it, when isolate after
> +	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
> +	 * __first_valid_page() will return unexpected NULL in the for loop
> +	 * below.
> +	 */
> +	if (isolate_before) {
> +		if (!pfn_to_online_page(boundary_pfn))
> +			return 0;
> +	} else {
> +		if (!pfn_to_online_page(boundary_pfn - 1))
> +			return 0;
> +	}
> +
> +	for (pfn = start_pfn; pfn < boundary_pfn;) {
> +		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
> +
> +		VM_BUG_ON(!page);
> +		pfn = page_to_pfn(page);
> +		/*
> +		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
> +		 * free pages in [start_pfn, boundary_pfn), its head page will
> +		 * always be in the range.
> +		 */
> +		if (PageBuddy(page)) {
> +			int order = buddy_order(page);
> +
> +			if (pfn + (1UL << order) > boundary_pfn)
> +				split_free_page(page, order, boundary_pfn - pfn);
> +			pfn += (1UL << order);
> +			continue;
> +		}
> +		/*
> +		 * migrate compound pages then let the free page handling code
> +		 * above do the rest. If migration is not possible, just fail.
> +		 */
> +		if (PageCompound(page)) {
> +			unsigned long nr_pages = compound_nr(page);
> +			struct page *head = compound_head(page);
> +			unsigned long head_pfn = page_to_pfn(head);
> +
> +			if (head_pfn + nr_pages < boundary_pfn) {
> +				pfn = head_pfn + nr_pages;
> +				continue;
> +			}
> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +			/*
> +			 * hugetlb, lru compound (THP), and movable compound pages
> +			 * can be migrated. Otherwise, fail the isolation.
> +			 */
> +			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
> +				int order;
> +				unsigned long outer_pfn;
> +				int ret;
> +				struct compact_control cc = {
> +					.nr_migratepages = 0,
> +					.order = -1,
> +					.zone = page_zone(pfn_to_page(head_pfn)),
> +					.mode = MIGRATE_SYNC,
> +					.ignore_skip_hint = true,
> +					.no_set_skip_hint = true,
> +					.gfp_mask = gfp_flags,
> +					.alloc_contig = true,
> +				};
> +				INIT_LIST_HEAD(&cc.migratepages);
> +
> +				ret = __alloc_contig_migrate_range(&cc, head_pfn,
> +							head_pfn + nr_pages);
> +
> +				if (ret)
> +					goto failed;
> +				/*
> +				 * reset pfn to the head of the free page, so
> +				 * that the free page handling code above can split
> +				 * the free page to the right migratetype list.
> +				 *
> +				 * head_pfn is not used here as a hugetlb page order
> +				 * can be bigger than MAX_ORDER-1, but after it is
> +				 * freed, the free page order is not. Use pfn within
> +				 * the range to find the head of the free page.
> +				 */
> +				order = 0;
> +				outer_pfn = pfn;
> +				while (!PageBuddy(pfn_to_page(outer_pfn))) {
> +					if (++order >= MAX_ORDER) {
> +						outer_pfn = pfn;
> +						break;
> +					}
> +					outer_pfn &= ~0UL << order;
> +				}
> +				pfn = outer_pfn;
> +				continue;
> +			} else
> +#endif
> +				goto failed;
> +		}
> +
> +		pfn++;
> +	}
> +	return 0;
> +failed:
> +	/* restore the original migratetype */
> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
> +	return -EBUSY;
> +}
> +
>  /**
>   * start_isolate_page_range() - make page-allocation-type of range of pages to
>   * be MIGRATE_ISOLATE.
> @@ -293,6 +453,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   *					 and PageOffline() pages.
>   *			REPORT_FAILURE - report details about the failure to
>   *			isolate the range
> + * @gfp_flags:		GFP flags used for migrating pages that sit across the
> + *			range boundaries.
>   *
>   * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>   * the range will never be allocated. Any free pages and pages freed in the
> @@ -301,6 +463,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * pages in the range finally, the caller have to free all pages in the range.
>   * test_page_isolated() can be used for test it.
>   *
> + * The function first tries to isolate the pageblocks at the beginning and end
> + * of the range, since there might be pages across the range boundaries.
> + * Afterwards, it isolates the rest of the range.
> + *
>   * There is no high level synchronization mechanism that prevents two threads
>   * from trying to isolate overlapping ranges. If this happens, one thread
>   * will notice pageblocks in the overlapping range already set to isolate.
> @@ -321,21 +487,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			     unsigned migratetype, int flags)
> +			     int migratetype, int flags, gfp_t gfp_flags)
>  {
>  	unsigned long pfn;
>  	struct page *page;
> +	int ret;
>
>  	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>  	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>
> -	for (pfn = start_pfn;
> -	     pfn < end_pfn;
> +	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
> +	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
> +	if (ret)
> +		return ret;
> +
> +	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
> +	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
> +	if (ret) {
> +		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
> +		return ret;
> +	}
> +
> +	/* skip isolated pageblocks at the beginning and end */
> +	for (pfn = start_pfn + pageblock_nr_pages;
> +	     pfn < end_pfn - pageblock_nr_pages;
>  	     pfn += pageblock_nr_pages) {
>  		page = __first_valid_page(pfn, pageblock_nr_pages);
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
>  			undo_isolate_page_range(start_pfn, pfn, migratetype);
> +			unset_migratetype_isolate(
> +				pfn_to_page(end_pfn - pageblock_nr_pages),
> +				migratetype);
>  			return -EBUSY;
>  		}
>  	}
> @@ -346,7 +529,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>   * Make isolated pages available again.
>   */
>  void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			    unsigned migratetype)
> +			    int migratetype)
>  {
>  	unsigned long pfn;
>  	struct page *page;
> -- 
> 2.35.1
>
> --
> Best Regards,
> Yan, Zi

To address the infinite loop issue reported by Qian Cai, the follow fixup should be applied to the commit above, another fixup patch should be applied to Patch 4 in this series (I will reply to Patch 4 email) :


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c7252ed14a0..76551933bb1d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1114,13 +1114,16 @@ void split_free_page(struct page *free_page,
 	unsigned long flags;
 	int free_page_order;

+	if (split_pfn_offset == 0)
+		return;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	del_page_from_free_list(free_page, zone, order);
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

-		free_page_order = ffs(split_pfn_offset) - 1;
+		free_page_order = min(pfn ? __ffs(pfn) : order, __fls(split_pfn_offset));
 		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
 				mt, FPI_NONE);
 		pfn += 1UL << free_page_order;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 8a0f16d2e4c3..7e45736d6451 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -283,6 +283,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * isolate_single_pageblock() -- tries to isolate a pageblock that might be
  * within a free or in-use page.
  * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @flags:			isolation flags
  * @gfp_flags:			GFP flags used for migrating pages
  * @isolate_before:	isolate the pageblock before the boundary_pfn
  *
@@ -298,14 +299,15 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * either. The function handles this by splitting the free page or migrating
  * the in-use page then splitting the free page.
  */
-static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
-			bool isolate_before)
+static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
+			gfp_t gfp_flags, bool isolate_before)
 {
 	unsigned char saved_mt;
 	unsigned long start_pfn;
 	unsigned long isolate_pageblock;
 	unsigned long pfn;
 	struct zone *zone;
+	int ret;

 	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));

@@ -325,7 +327,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				      zone->zone_start_pfn);

 	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
+	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+
+	if (ret)
+		return ret;

 	/*
 	 * Bail out early when the to-be-isolated pageblock does not form
@@ -374,7 +380,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);

-			if (head_pfn + nr_pages < boundary_pfn) {
+			if (head_pfn + nr_pages <= boundary_pfn) {
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
@@ -386,7 +392,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
 				int order;
 				unsigned long outer_pfn;
-				int ret;
+				int page_mt = get_pageblock_migratetype(page);
+				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -399,9 +406,31 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				};
 				INIT_LIST_HEAD(&cc.migratepages);

+				/*
+				 * XXX: mark the page as MIGRATE_ISOLATE so that
+				 * no one else can grab the freed page after migration.
+				 * Ideally, the page should be freed as two separate
+				 * pages to be added into separate migratetype free
+				 * lists.
+				 */
+				if (isolate_page) {
+					ret = set_migratetype_isolate(page, page_mt,
+						flags, head_pfn, head_pfn + nr_pages);
+					if (ret)
+						goto failed;
+				}
+
 				ret = __alloc_contig_migrate_range(&cc, head_pfn,
 							head_pfn + nr_pages);

+				/*
+				 * restore the page's migratetype so that it can
+				 * be split into separate migratetype free lists
+				 * later.
+				 */
+				if (isolate_page)
+					unset_migratetype_isolate(page, page_mt);
+
 				if (ret)
 					goto failed;
 				/*
@@ -417,10 +446,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				order = 0;
 				outer_pfn = pfn;
 				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					if (++order >= MAX_ORDER) {
-						outer_pfn = pfn;
-						break;
-					}
+					/* stop if we cannot find the free page */
+					if (++order >= MAX_ORDER)
+						goto failed;
 					outer_pfn &= ~0UL << order;
 				}
 				pfn = outer_pfn;
@@ -435,7 +463,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 	return 0;
 failed:
 	/* restore the original migratetype */
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
+	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
 	return -EBUSY;
 }

@@ -497,12 +525,12 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));

 	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
+	ret = isolate_single_pageblock(start_pfn, flags, gfp_flags, false);
 	if (ret)
 		return ret;

 	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
-	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
+	ret = isolate_single_pageblock(end_pfn, flags, gfp_flags, true);
 	if (ret) {
 		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
 		return ret;



The complete commit with the fixup patch applied is:

From 71a4c830ce96d23aacb11ec715cc27d482acdd93 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Thu, 12 May 2022 20:22:58 -0700
Subject: [PATCH] mm: make alloc_contig_range work at pageblock granularity

alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
merging pageblocks with different migratetypes.  It might unnecessarily
convert extra pageblocks at the beginning and at the end of the range.
Change alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified by alloc_contig_range().  Because these=

Partially isolated pages causes free page accounting issues.  The free
pages will be split and freed into separate migratetype lists; the in-use=

Pages will be migrated then the freed pages will be handled in the
aforementioned way.

[ziy@nvidia.com: fix deadlock/crash]
  Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/page-isolation.h |   4 +-
 mm/internal.h                  |   6 +
 mm/memory_hotplug.c            |   3 +-
 mm/page_alloc.c                |  57 +++++++--
 mm/page_isolation.c            | 221 ++++++++++++++++++++++++++++++++-
 5 files changed, 273 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index e14eddf6741a..5456b7be38ae 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned migratetype, int flags);
+			 int migratetype, int flags, gfp_t gfp_flags);

 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
@@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  */
 void
 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			unsigned migratetype);
+			int migratetype);

 /*
  * Test all pages in [start_pfn, end_pfn) are isolated or not.
diff --git a/mm/internal.h b/mm/internal.h
index ddd09245a6db..a770029beb08 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 			  phys_addr_t min_addr,
 			  int nid, bool exact_nid);

+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA

 /*
@@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
 int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
+
+int __alloc_contig_migrate_range(struct compact_control *cc,
+					unsigned long start, unsigned long end);
 #endif
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e99fd60548f5..945191708ef6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1837,7 +1837,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
-				       MEMORY_OFFLINE | REPORT_FAILURE);
+				       MEMORY_OFFLINE | REPORT_FAILURE,
+				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
 	if (ret) {
 		reason = "failure to isolate range";
 		goto failed_removal_pcplists_disabled;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0756f046b644..76551933bb1d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1094,6 +1094,46 @@ static inline void __free_one_page(struct page *page,
 		page_reporting_notify_free(order);
 }

+/**
+ * split_free_page() -- split a free page at split_pfn_offset
+ * @free_page:		the original free page
+ * @order:		the order of the page
+ * @split_pfn_offset:	split offset within the page
+ *
+ * It is used when the free page crosses two pageblocks with different migratetypes
+ * at split_pfn_offset within the page. The split free page will be put into
+ * separate migratetype lists afterwards. Otherwise, the function achieves
+ * nothing.
+ */
+void split_free_page(struct page *free_page,
+				int order, unsigned long split_pfn_offset)
+{
+	struct zone *zone = page_zone(free_page);
+	unsigned long free_page_pfn = page_to_pfn(free_page);
+	unsigned long pfn;
+	unsigned long flags;
+	int free_page_order;
+
+	if (split_pfn_offset == 0)
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	del_page_from_free_list(free_page, zone, order);
+	for (pfn = free_page_pfn;
+	     pfn < free_page_pfn + (1UL << order);) {
+		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
+
+		free_page_order = min(pfn ? __ffs(pfn) : order, __fls(split_pfn_offset));
+		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
+				mt, FPI_NONE);
+		pfn += 1UL << free_page_order;
+		split_pfn_offset -= (1UL << free_page_order);
+		/* we have done the first part, now switch to second part */
+		if (split_pfn_offset == 0)
+			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -8951,7 +8991,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
 #endif

 /* [start, end) must belong to a single zone. */
-static int __alloc_contig_migrate_range(struct compact_control *cc,
+int __alloc_contig_migrate_range(struct compact_control *cc,
 					unsigned long start, unsigned long end)
 {
 	/* This function is based on compact_zone() from compaction.c. */
@@ -9034,7 +9074,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	unsigned int order;
+	int order;
 	int ret = 0;

 	struct compact_control cc = {
@@ -9053,14 +9093,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * What we do here is we mark all pageblocks in range as
 	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
 	 * have different sizes, and due to the way page allocator
-	 * work, we align the range to biggest of the two pages so
-	 * that page allocator won't try to merge buddies from
-	 * different pageblocks and change MIGRATE_ISOLATE to some
-	 * other migration type.
+	 * work, start_isolate_page_range() has special handlings for this.
 	 *
 	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
 	 * migrate the pages from an unaligned range (ie. pages that
-	 * we are interested in).  This will put all the pages in
+	 * we are interested in). This will put all the pages in
 	 * range back to page allocator as MIGRATE_ISOLATE.
 	 *
 	 * When this is done, we take the pages in range from page
@@ -9074,9 +9111,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */

 	ret = start_isolate_page_range(pfn_max_align_down(start),
-				       pfn_max_align_up(end), migratetype, 0);
+				pfn_max_align_up(end), migratetype, 0, gfp_mask);
 	if (ret)
-		return ret;
+		goto done;

 	drain_all_pages(cc.zone);

@@ -9096,7 +9133,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	ret = 0;

 	/*
-	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
+	 * Pages from [start, end) are within a pageblock_nr_pages
 	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
 	 * more, all pages in [start, end) are free in page allocator.
 	 * What we are going to do is to allocate all pages from
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c2f7a8bb634d..6b47acaf51f3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	return -EBUSY;
 }

-static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
+static void unset_migratetype_isolate(struct page *page, int migratetype)
 {
 	struct zone *zone;
 	unsigned long flags, nr_pages;
@@ -279,6 +279,194 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 	return NULL;
 }

+/**
+ * isolate_single_pageblock() -- tries to isolate a pageblock that might be
+ * within a free or in-use page.
+ * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @flags:			isolation flags
+ * @gfp_flags:			GFP flags used for migrating pages
+ * @isolate_before:	isolate the pageblock before the boundary_pfn
+ *
+ * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
+ * pageblock. When not all pageblocks within a page are isolated at the same
+ * time, free page accounting can go wrong. For example, in the case of
+ * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
+ * [         MAX_ORDER-1         ]
+ * [  pageblock0  |  pageblock1  ]
+ * When either pageblock is isolated, if it is a free page, the page is not
+ * split into separate migratetype lists, which is supposed to; if it is an
+ * in-use page and freed later, __free_one_page() does not split the free page
+ * either. The function handles this by splitting the free page or migrating
+ * the in-use page then splitting the free page.
+ */
+static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
+			gfp_t gfp_flags, bool isolate_before)
+{
+	unsigned char saved_mt;
+	unsigned long start_pfn;
+	unsigned long isolate_pageblock;
+	unsigned long pfn;
+	struct zone *zone;
+	int ret;
+
+	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
+
+	if (isolate_before)
+		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
+	else
+		isolate_pageblock = boundary_pfn;
+
+	/*
+	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
+	 * only isolating a subset of pageblocks from a bigger than pageblock
+	 * free or in-use page. Also make sure all to-be-isolated pageblocks
+	 * are within the same zone.
+	 */
+	zone  = page_zone(pfn_to_page(isolate_pageblock));
+	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
+				      zone->zone_start_pfn);
+
+	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
+	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * Bail out early when the to-be-isolated pageblock does not form
+	 * a free or in-use page across boundary_pfn:
+	 *
+	 * 1. isolate before boundary_pfn: the page after is not online
+	 * 2. isolate after boundary_pfn: the page before is not online
+	 *
+	 * This also ensures correctness. Without it, when isolate after
+	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
+	 * __first_valid_page() will return unexpected NULL in the for loop
+	 * below.
+	 */
+	if (isolate_before) {
+		if (!pfn_to_online_page(boundary_pfn))
+			return 0;
+	} else {
+		if (!pfn_to_online_page(boundary_pfn - 1))
+			return 0;
+	}
+
+	for (pfn = start_pfn; pfn < boundary_pfn;) {
+		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
+
+		VM_BUG_ON(!page);
+		pfn = page_to_pfn(page);
+		/*
+		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
+		 * free pages in [start_pfn, boundary_pfn), its head page will
+		 * always be in the range.
+		 */
+		if (PageBuddy(page)) {
+			int order = buddy_order(page);
+
+			if (pfn + (1UL << order) > boundary_pfn)
+				split_free_page(page, order, boundary_pfn - pfn);
+			pfn += (1UL << order);
+			continue;
+		}
+		/*
+		 * migrate compound pages then let the free page handling code
+		 * above do the rest. If migration is not possible, just fail.
+		 */
+		if (PageCompound(page)) {
+			unsigned long nr_pages = compound_nr(page);
+			struct page *head = compound_head(page);
+			unsigned long head_pfn = page_to_pfn(head);
+
+			if (head_pfn + nr_pages <= boundary_pfn) {
+				pfn = head_pfn + nr_pages;
+				continue;
+			}
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+			/*
+			 * hugetlb, lru compound (THP), and movable compound pages
+			 * can be migrated. Otherwise, fail the isolation.
+			 */
+			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+				int order;
+				unsigned long outer_pfn;
+				int page_mt = get_pageblock_migratetype(page);
+				bool isolate_page = !is_migrate_isolate_page(page);
+				struct compact_control cc = {
+					.nr_migratepages = 0,
+					.order = -1,
+					.zone = page_zone(pfn_to_page(head_pfn)),
+					.mode = MIGRATE_SYNC,
+					.ignore_skip_hint = true,
+					.no_set_skip_hint = true,
+					.gfp_mask = gfp_flags,
+					.alloc_contig = true,
+				};
+				INIT_LIST_HEAD(&cc.migratepages);
+
+				/*
+				 * XXX: mark the page as MIGRATE_ISOLATE so that
+				 * no one else can grab the freed page after migration.
+				 * Ideally, the page should be freed as two separate
+				 * pages to be added into separate migratetype free
+				 * lists.
+				 */
+				if (isolate_page) {
+					ret = set_migratetype_isolate(page, page_mt,
+						flags, head_pfn, boundary_pfn - 1);
+					if (ret)
+						goto failed;
+				}
+
+				ret = __alloc_contig_migrate_range(&cc, head_pfn,
+							head_pfn + nr_pages);
+
+				/*
+				 * restore the page's migratetype so that it can
+				 * be split into separate migratetype free lists
+				 * later.
+				 */
+				if (isolate_page)
+					unset_migratetype_isolate(page, page_mt);
+
+				if (ret)
+					goto failed;
+				/*
+				 * reset pfn to the head of the free page, so
+				 * that the free page handling code above can split
+				 * the free page to the right migratetype list.
+				 *
+				 * head_pfn is not used here as a hugetlb page order
+				 * can be bigger than MAX_ORDER-1, but after it is
+				 * freed, the free page order is not. Use pfn within
+				 * the range to find the head of the free page.
+				 */
+				order = 0;
+				outer_pfn = pfn;
+				while (!PageBuddy(pfn_to_page(outer_pfn))) {
+					/* stop if we cannot find the free page */
+					if (++order >= MAX_ORDER)
+						goto failed;
+					outer_pfn &= ~0UL << order;
+				}
+				pfn = outer_pfn;
+				continue;
+			} else
+#endif
+				goto failed;
+		}
+
+		pfn++;
+	}
+	return 0;
+failed:
+	/* restore the original migratetype */
+	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
+	return -EBUSY;
+}
+
 /**
  * start_isolate_page_range() - make page-allocation-type of range of pages to
  * be MIGRATE_ISOLATE.
@@ -293,6 +481,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *					 and PageOffline() pages.
  *			REPORT_FAILURE - report details about the failure to
  *			isolate the range
+ * @gfp_flags:		GFP flags used for migrating pages that sit across the
+ *			range boundaries.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -301,6 +491,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * pages in the range finally, the caller have to free all pages in the range.
  * test_page_isolated() can be used for test it.
  *
+ * The function first tries to isolate the pageblocks at the beginning and end
+ * of the range, since there might be pages across the range boundaries.
+ * Afterwards, it isolates the rest of the range.
+ *
  * There is no high level synchronization mechanism that prevents two threads
  * from trying to isolate overlapping ranges. If this happens, one thread
  * will notice pageblocks in the overlapping range already set to isolate.
@@ -321,21 +515,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			     unsigned migratetype, int flags)
+			     int migratetype, int flags, gfp_t gfp_flags)
 {
 	unsigned long pfn;
 	struct page *page;
+	int ret;

 	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
 	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));

-	for (pfn = start_pfn;
-	     pfn < end_pfn;
+	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
+	ret = isolate_single_pageblock(start_pfn, flags, gfp_flags, false);
+	if (ret)
+		return ret;
+
+	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
+	ret = isolate_single_pageblock(end_pfn, flags, gfp_flags, true);
+	if (ret) {
+		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
+		return ret;
+	}
+
+	/* skip isolated pageblocks at the beginning and end */
+	for (pfn = start_pfn + pageblock_nr_pages;
+	     pfn < end_pfn - pageblock_nr_pages;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			unset_migratetype_isolate(
+				pfn_to_page(end_pfn - pageblock_nr_pages),
+				migratetype);
 			return -EBUSY;
 		}
 	}
@@ -346,7 +557,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
  * Make isolated pages available again.
  */
 void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			    unsigned migratetype)
+			    int migratetype)
 {
 	unsigned long pfn;
 	struct page *page;
-- 
2.35.1



--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation.
  2022-04-25 14:31 ` [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation Zi Yan
@ 2022-05-24 19:02   ` Zi Yan
  0 siblings, 0 replies; 44+ messages in thread
From: Zi Yan @ 2022-05-24 19:02 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, linux-mm, Qian Cai
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 12363 bytes --]

On 25 Apr 2022, at 10:31, Zi Yan wrote:

> From: Zi Yan <ziy@nvidia.com>
>
> Now start_isolate_page_range() is ready to handle arbitrary range
> isolation, so move the alignment check/adjustment into the function body.
> Do the same for its counterpart undo_isolate_page_range().
> alloc_contig_range(), its caller, can pass an arbitrary range instead of
> a MAX_ORDER_NR_PAGES aligned one.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/page_alloc.c     | 16 ++--------------
>  mm/page_isolation.c | 33 ++++++++++++++++-----------------
>  2 files changed, 18 insertions(+), 31 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 70ddd9a0bcf3..a002cf12eb6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8924,16 +8924,6 @@ void *__init alloc_large_system_hash(const char *tablename,
>  }
>
>  #ifdef CONFIG_CONTIG_ALLOC
> -static unsigned long pfn_max_align_down(unsigned long pfn)
> -{
> -	return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
> -static unsigned long pfn_max_align_up(unsigned long pfn)
> -{
> -	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
>  #if defined(CONFIG_DYNAMIC_DEBUG) || \
>  	(defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
> @@ -9075,8 +9065,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * put back to page allocator so that buddy can use them.
>  	 */
>
> -	ret = start_isolate_page_range(pfn_max_align_down(start),
> -				pfn_max_align_up(end), migratetype, 0, gfp_mask);
> +	ret = start_isolate_page_range(start, end, migratetype, 0, gfp_mask);
>  	if (ret)
>  		goto done;
>
> @@ -9157,8 +9146,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		free_contig_range(end, outer_end - end);
>
>  done:
> -	undo_isolate_page_range(pfn_max_align_down(start),
> -				pfn_max_align_up(end), migratetype);
> +	undo_isolate_page_range(start, end, migratetype);
>  	return ret;
>  }
>  EXPORT_SYMBOL(alloc_contig_range);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 94b3467e5ba2..75e454f5cf45 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -435,7 +435,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>   * be MIGRATE_ISOLATE.
>   * @start_pfn:		The lower PFN of the range to be isolated.
>   * @end_pfn:		The upper PFN of the range to be isolated.
> - *			start_pfn/end_pfn must be aligned to pageblock_order.
>   * @migratetype:	Migrate type to set in error recovery.
>   * @flags:		The following flags are allowed (they can be combined in
>   *			a bit mask)
> @@ -482,33 +481,33 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	unsigned long pfn;
>  	struct page *page;
> +	/* isolation is done at page block granularity */
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
> +	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>  	int ret;
>
> -	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
> -	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
> -
> -	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
> -	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
> +	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
> +	ret = isolate_single_pageblock(isolate_start, gfp_flags, false);
>  	if (ret)
>  		return ret;
>
> -	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
> -	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
> +	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> +	ret = isolate_single_pageblock(isolate_end, gfp_flags, true);
>  	if (ret) {
> -		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
> +		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>  		return ret;
>  	}
>
>  	/* skip isolated pageblocks at the beginning and end */
> -	for (pfn = start_pfn + pageblock_nr_pages;
> -	     pfn < end_pfn - pageblock_nr_pages;
> +	for (pfn = isolate_start + pageblock_nr_pages;
> +	     pfn < isolate_end - pageblock_nr_pages;
>  	     pfn += pageblock_nr_pages) {
>  		page = __first_valid_page(pfn, pageblock_nr_pages);
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
> -			undo_isolate_page_range(start_pfn, pfn, migratetype);
> +			undo_isolate_page_range(isolate_start, pfn, migratetype);
>  			unset_migratetype_isolate(
> -				pfn_to_page(end_pfn - pageblock_nr_pages),
> +				pfn_to_page(isolate_end - pageblock_nr_pages),
>  				migratetype);
>  			return -EBUSY;
>  		}
> @@ -524,12 +523,12 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	unsigned long pfn;
>  	struct page *page;
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
> +	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>
> -	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
> -	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>
> -	for (pfn = start_pfn;
> -	     pfn < end_pfn;
> +	for (pfn = isolate_start;
> +	     pfn < isolate_end;
>  	     pfn += pageblock_nr_pages) {
>  		page = __first_valid_page(pfn, pageblock_nr_pages);
>  		if (!page || !is_migrate_isolate_page(page))
> -- 
> 2.35.1

The fixup patch below should be applied to address the infinite loop issue reported by Qian Cai:

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 46cbc4621d84..b70a03d9c52b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -520,12 +520,12 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	int ret;

 	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(isolate_start, gfp_flags, false);
+	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
 	if (ret)
 		return ret;

 	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, gfp_flags, true);
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
 	if (ret) {
 		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;




The complete commit with the fixup patch applied is:

From 211ef82d35d3a0cf108846a440145688f7cfa21f Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Thu, 12 May 2022 20:22:58 -0700
Subject: [PATCH] mm: page_isolation: enable arbitrary range page
 isolation.

Now start_isolate_page_range() is ready to handle arbitrary range
isolation, so move the alignment check/adjustment into the function body.
Do the same for its counterpart undo_isolate_page_range().
alloc_contig_range(), its caller, can pass an arbitrary range instead of a
MAX_ORDER_NR_PAGES aligned one.

Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/page_alloc.c     | 16 ++--------------
 mm/page_isolation.c | 33 ++++++++++++++++-----------------
 2 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76551933bb1d..9a21ea9af35c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8959,16 +8959,6 @@ void *__init alloc_large_system_hash(const char *tablename,
 }

 #ifdef CONFIG_CONTIG_ALLOC
-static unsigned long pfn_max_align_down(unsigned long pfn)
-{
-	return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
-}
-
-static unsigned long pfn_max_align_up(unsigned long pfn)
-{
-	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
-}
-
 #if defined(CONFIG_DYNAMIC_DEBUG) || \
 	(defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
@@ -9110,8 +9100,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * put back to page allocator so that buddy can use them.
 	 */

-	ret = start_isolate_page_range(pfn_max_align_down(start),
-				pfn_max_align_up(end), migratetype, 0, gfp_mask);
+	ret = start_isolate_page_range(start, end, migratetype, 0, gfp_mask);
 	if (ret)
 		goto done;

@@ -9192,8 +9181,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		free_contig_range(end, outer_end - end);

 done:
-	undo_isolate_page_range(pfn_max_align_down(start),
-				pfn_max_align_up(end), migratetype);
+	undo_isolate_page_range(start, end, migratetype);
 	return ret;
 }
 EXPORT_SYMBOL(alloc_contig_range);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6b47acaf51f3..706915c9a380 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -472,7 +472,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
  * be MIGRATE_ISOLATE.
  * @start_pfn:		The lower PFN of the range to be isolated.
  * @end_pfn:		The upper PFN of the range to be isolated.
- *			start_pfn/end_pfn must be aligned to pageblock_order.
  * @migratetype:	Migrate type to set in error recovery.
  * @flags:		The following flags are allowed (they can be combined in
  *			a bit mask)
@@ -519,33 +518,33 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
+	/* isolation is done at page block granularity */
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
+	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
 	int ret;

-	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
-	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
-
-	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(start_pfn, flags, gfp_flags, false);
+	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
+	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
 	if (ret)
 		return ret;

-	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
-	ret = isolate_single_pageblock(end_pfn, flags, gfp_flags, true);
+	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
 	if (ret) {
-		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
+		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;
 	}

 	/* skip isolated pageblocks at the beginning and end */
-	for (pfn = start_pfn + pageblock_nr_pages;
-	     pfn < end_pfn - pageblock_nr_pages;
+	for (pfn = isolate_start + pageblock_nr_pages;
+	     pfn < isolate_end - pageblock_nr_pages;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
-			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			undo_isolate_page_range(isolate_start, pfn, migratetype);
 			unset_migratetype_isolate(
-				pfn_to_page(end_pfn - pageblock_nr_pages),
+				pfn_to_page(isolate_end - pageblock_nr_pages),
 				migratetype);
 			return -EBUSY;
 		}
@@ -561,12 +560,12 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
+	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);

-	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
-	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));

-	for (pfn = start_pfn;
-	     pfn < end_pfn;
+	for (pfn = isolate_start;
+	     pfn < isolate_end;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (!page || !is_migrate_isolate_page(page))
-- 
2.35.1



--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-04-29 13:54   ` Zi Yan
  2022-05-24 19:00     ` Zi Yan
@ 2022-05-25 17:41     ` Doug Berger
  2022-05-25 17:53       ` Zi Yan
  1 sibling, 1 reply; 44+ messages in thread
From: Doug Berger @ 2022-05-25 17:41 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, linux-mm
  Cc: linux-kernel, virtualization, Vlastimil Babka, Mel Gorman,
	Eric Ren, Mike Rapoport, Oscar Salvador, Christophe Leroy,
	Andrew Morton, kernel test robot, Qian Cai

I am seeing some free memory accounting problems with linux-next that I 
have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make 
alloc_contig_range work at pageblock granularity").

On an arm64 SMP platform with 4GB total memory and the default 16MB 
default CMA pool, I am seeing the following after boot with a sysrq Show 
Memory (e.g. 'echo m > /proc/sysrq-trigger'):

[   16.015906] sysrq: Show Memory
[   16.019039] Mem-Info:
[   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
[   16.021348]  active_file:0 inactive_file:0 isolated_file:0
[   16.021348]  unevictable:0 dirty:0 writeback:0
[   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
[   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
[   16.021348]  kernel_misc_reclaimable:0
[   16.021348]  free:976766 free_pcp:991 free_cma:7017
[   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB 
writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
[   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB 
high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB 
present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB 
free_pcp:636kB local_pcp:0kB free_cma:28068kB
[   16.108650] lowmem_reserve[]: 0 0 944 944
[   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB 
high:3864kB reserved_highatomic:0KB active_anon:58416kB 
inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:1048576kB managed:967352kB mlocked:0kB 
bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
[   16.140393] lowmem_reserve[]: 0 0 0 0
[   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB 
(M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 
732*4096kB (MC) = 3027724kB
[   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB 
(ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 
0*2048kB 210*4096kB (M) = 866028kB
[   16.175165] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=1048576kB
[   16.183937] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=32768kB
[   16.192533] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB
[   16.201040] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=64kB
[   16.209374] 15146 total pagecache pages
[   16.213246] 0 pages in swap cache
[   16.216595] Swap cache stats: add 0, delete 0, find 0/0
[   16.221867] Free swap  = 0kB
[   16.224780] Total swap = 0kB
[   16.227693] 1048576 pages RAM
[   16.230694] 0 pages HighMem/MovableOnly
[   16.234564] 49240 pages reserved
[   16.237825] 4096 pages cma reserved

Some anomolies in the above are:
free_cma:7017 with only 4096 pages cma reserved
DMA free:3041036kB with only managed:3029992kB

I'm not sure what is going on here, but I am suspicious of 
split_free_page() since del_page_from_free_list doesn't affect 
migrate_type accounting, but __free_one_page() can.
Also PageBuddy(page) is being checked without zone->lock in 
isolate_single_pageblock().

Please investigate this as well.

Thanks!
     Doug

On 4/29/2022 6:54 AM, Zi Yan wrote:
> On 25 Apr 2022, at 10:31, Zi Yan wrote:
> 
>> From: Zi Yan <ziy@nvidia.com>
>>
>> alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
>> merging pageblocks with different migratetypes. It might unnecessarily
>> convert extra pageblocks at the beginning and at the end of the range.
>> Change alloc_contig_range() to work at pageblock granularity.
>>
>> Special handling is needed for free pages and in-use pages across the
>> boundaries of the range specified by alloc_contig_range(). Because these
>> partially isolated pages causes free page accounting issues. The free
>> pages will be split and freed into separate migratetype lists; the
>> in-use pages will be migrated then the freed pages will be handled in
>> the aforementioned way.
>>
>> Reported-by: kernel test robot <lkp@intel.com>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   include/linux/page-isolation.h |   4 +-
>>   mm/internal.h                  |   6 ++
>>   mm/memory_hotplug.c            |   3 +-
>>   mm/page_alloc.c                |  54 ++++++++--
>>   mm/page_isolation.c            | 184 ++++++++++++++++++++++++++++++++-
>>   5 files changed, 233 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index e14eddf6741a..5456b7be38ae 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>    */
>>   int
>>   start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			 unsigned migratetype, int flags);
>> +			 int migratetype, int flags, gfp_t gfp_flags);
>>
>>   /*
>>    * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
>> @@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>    */
>>   void
>>   undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			unsigned migratetype);
>> +			int migratetype);
>>
>>   /*
>>    * Test all pages in [start_pfn, end_pfn) are isolated or not.
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 919fa07e1031..0667abd57634 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>>   			  phys_addr_t min_addr,
>>   			  int nid, bool exact_nid);
>>
>> +void split_free_page(struct page *free_page,
>> +				int order, unsigned long split_pfn_offset);
>> +
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>>   /*
>> @@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
>>   int
>>   isolate_migratepages_range(struct compact_control *cc,
>>   			   unsigned long low_pfn, unsigned long end_pfn);
>> +
>> +int __alloc_contig_migrate_range(struct compact_control *cc,
>> +					unsigned long start, unsigned long end);
>>   #endif
>>   int find_suitable_fallback(struct free_area *area, unsigned int order,
>>   			int migratetype, bool only_stealable, bool *can_steal);
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 4c6065e5d274..9f8ae4cb77ee 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>   	/* set above range as isolated */
>>   	ret = start_isolate_page_range(start_pfn, end_pfn,
>>   				       MIGRATE_MOVABLE,
>> -				       MEMORY_OFFLINE | REPORT_FAILURE);
>> +				       MEMORY_OFFLINE | REPORT_FAILURE,
>> +				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>>   	if (ret) {
>>   		reason = "failure to isolate range";
>>   		goto failed_removal_pcplists_disabled;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ce23ac8ad085..70ddd9a0bcf3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
>>   		page_reporting_notify_free(order);
>>   }
>>
>> +/**
>> + * split_free_page() -- split a free page at split_pfn_offset
>> + * @free_page:		the original free page
>> + * @order:		the order of the page
>> + * @split_pfn_offset:	split offset within the page
>> + *
>> + * It is used when the free page crosses two pageblocks with different migratetypes
>> + * at split_pfn_offset within the page. The split free page will be put into
>> + * separate migratetype lists afterwards. Otherwise, the function achieves
>> + * nothing.
>> + */
>> +void split_free_page(struct page *free_page,
>> +				int order, unsigned long split_pfn_offset)
>> +{
>> +	struct zone *zone = page_zone(free_page);
>> +	unsigned long free_page_pfn = page_to_pfn(free_page);
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +	int free_page_order;
>> +
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	del_page_from_free_list(free_page, zone, order);
>> +	for (pfn = free_page_pfn;
>> +	     pfn < free_page_pfn + (1UL << order);) {
>> +		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>> +
>> +		free_page_order = ffs(split_pfn_offset) - 1;
>> +		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
>> +				mt, FPI_NONE);
>> +		pfn += 1UL << free_page_order;
>> +		split_pfn_offset -= (1UL << free_page_order);
>> +		/* we have done the first part, now switch to second part */
>> +		if (split_pfn_offset == 0)
>> +			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>> +	}
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>> +}
>>   /*
>>    * A bad page could be due to a number of fields. Instead of multiple branches,
>>    * try and check multiple fields with one check. The caller must do a detailed
>> @@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
>>   #endif
>>
>>   /* [start, end) must belong to a single zone. */
>> -static int __alloc_contig_migrate_range(struct compact_control *cc,
>> +int __alloc_contig_migrate_range(struct compact_control *cc,
>>   					unsigned long start, unsigned long end)
>>   {
>>   	/* This function is based on compact_zone() from compaction.c. */
>> @@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>   		       unsigned migratetype, gfp_t gfp_mask)
>>   {
>>   	unsigned long outer_start, outer_end;
>> -	unsigned int order;
>> +	int order;
>>   	int ret = 0;
>>
>>   	struct compact_control cc = {
>> @@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>   	 * What we do here is we mark all pageblocks in range as
>>   	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
>>   	 * have different sizes, and due to the way page allocator
>> -	 * work, we align the range to biggest of the two pages so
>> -	 * that page allocator won't try to merge buddies from
>> -	 * different pageblocks and change MIGRATE_ISOLATE to some
>> -	 * other migration type.
>> +	 * work, start_isolate_page_range() has special handlings for this.
>>   	 *
>>   	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
>>   	 * migrate the pages from an unaligned range (ie. pages that
>> -	 * we are interested in).  This will put all the pages in
>> +	 * we are interested in). This will put all the pages in
>>   	 * range back to page allocator as MIGRATE_ISOLATE.
>>   	 *
>>   	 * When this is done, we take the pages in range from page
>> @@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>   	 */
>>
>>   	ret = start_isolate_page_range(pfn_max_align_down(start),
>> -				       pfn_max_align_up(end), migratetype, 0);
>> +				pfn_max_align_up(end), migratetype, 0, gfp_mask);
>>   	if (ret)
>> -		return ret;
>> +		goto done;
>>
>>   	drain_all_pages(cc.zone);
>>
>> @@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>   	ret = 0;
>>
>>   	/*
>> -	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
>> +	 * Pages from [start, end) are within a pageblock_nr_pages
>>   	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
>>   	 * more, all pages in [start, end) are free in page allocator.
>>   	 * What we are going to do is to allocate all pages from
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c2f7a8bb634d..94b3467e5ba2 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>>   	return -EBUSY;
>>   }
>>
>> -static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>> +static void unset_migratetype_isolate(struct page *page, int migratetype)
>>   {
>>   	struct zone *zone;
>>   	unsigned long flags, nr_pages;
>> @@ -279,6 +279,157 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>   	return NULL;
>>   }
>>
>> +/**
>> + * isolate_single_pageblock() -- tries to isolate a pageblock that might be
>> + * within a free or in-use page.
>> + * @boundary_pfn:		pageblock-aligned pfn that a page might cross
>> + * @gfp_flags:			GFP flags used for migrating pages
>> + * @isolate_before:	isolate the pageblock before the boundary_pfn
>> + *
>> + * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>> + * pageblock. When not all pageblocks within a page are isolated at the same
>> + * time, free page accounting can go wrong. For example, in the case of
>> + * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
>> + * [         MAX_ORDER-1         ]
>> + * [  pageblock0  |  pageblock1  ]
>> + * When either pageblock is isolated, if it is a free page, the page is not
>> + * split into separate migratetype lists, which is supposed to; if it is an
>> + * in-use page and freed later, __free_one_page() does not split the free page
>> + * either. The function handles this by splitting the free page or migrating
>> + * the in-use page then splitting the free page.
>> + */
>> +static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>> +			bool isolate_before)
>> +{
>> +	unsigned char saved_mt;
>> +	unsigned long start_pfn;
>> +	unsigned long isolate_pageblock;
>> +	unsigned long pfn;
>> +	struct zone *zone;
>> +
>> +	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
>> +
>> +	if (isolate_before)
>> +		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
>> +	else
>> +		isolate_pageblock = boundary_pfn;
>> +
>> +	/*
>> +	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
>> +	 * only isolating a subset of pageblocks from a bigger than pageblock
>> +	 * free or in-use page. Also make sure all to-be-isolated pageblocks
>> +	 * are within the same zone.
>> +	 */
>> +	zone  = page_zone(pfn_to_page(isolate_pageblock));
>> +	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>> +				      zone->zone_start_pfn);
>> +
>> +	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
>> +
>> +	/*
>> +	 * Bail out early when the to-be-isolated pageblock does not form
>> +	 * a free or in-use page across boundary_pfn:
>> +	 *
>> +	 * 1. isolate before boundary_pfn: the page after is not online
>> +	 * 2. isolate after boundary_pfn: the page before is not online
>> +	 *
>> +	 * This also ensures correctness. Without it, when isolate after
>> +	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
>> +	 * __first_valid_page() will return unexpected NULL in the for loop
>> +	 * below.
>> +	 */
>> +	if (isolate_before) {
>> +		if (!pfn_to_online_page(boundary_pfn))
>> +			return 0;
>> +	} else {
>> +		if (!pfn_to_online_page(boundary_pfn - 1))
>> +			return 0;
>> +	}
>> +
>> +	for (pfn = start_pfn; pfn < boundary_pfn;) {
>> +		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
>> +
>> +		VM_BUG_ON(!page);
>> +		pfn = page_to_pfn(page);
>> +		/*
>> +		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
>> +		 * free pages in [start_pfn, boundary_pfn), its head page will
>> +		 * always be in the range.
>> +		 */
>> +		if (PageBuddy(page)) {
>> +			int order = buddy_order(page);
>> +
>> +			if (pfn + (1UL << order) > boundary_pfn)
>> +				split_free_page(page, order, boundary_pfn - pfn);
>> +			pfn += (1UL << order);
>> +			continue;
>> +		}
>> +		/*
>> +		 * migrate compound pages then let the free page handling code
>> +		 * above do the rest. If migration is not enabled, just fail.
>> +		 */
>> +		if (PageHuge(page) || PageTransCompound(page)) {
>> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
>> +			unsigned long nr_pages = compound_nr(page);
>> +			int order = compound_order(page);
>> +			struct page *head = compound_head(page);
>> +			unsigned long head_pfn = page_to_pfn(head);
>> +			int ret;
>> +			struct compact_control cc = {
>> +				.nr_migratepages = 0,
>> +				.order = -1,
>> +				.zone = page_zone(pfn_to_page(head_pfn)),
>> +				.mode = MIGRATE_SYNC,
>> +				.ignore_skip_hint = true,
>> +				.no_set_skip_hint = true,
>> +				.gfp_mask = gfp_flags,
>> +				.alloc_contig = true,
>> +			};
>> +			INIT_LIST_HEAD(&cc.migratepages);
>> +
>> +			if (head_pfn + nr_pages < boundary_pfn) {
>> +				pfn += nr_pages;
>> +				continue;
>> +			}
>> +
>> +			ret = __alloc_contig_migrate_range(&cc, head_pfn,
>> +						head_pfn + nr_pages);
>> +
>> +			if (ret)
>> +				goto failed;
>> +			/*
>> +			 * reset pfn, let the free page handling code above
>> +			 * split the free page to the right migratetype list.
>> +			 *
>> +			 * head_pfn is not used here as a hugetlb page order
>> +			 * can be bigger than MAX_ORDER-1, but after it is
>> +			 * freed, the free page order is not. Use pfn within
>> +			 * the range to find the head of the free page and
>> +			 * reset order to 0 if a hugetlb page with
>> +			 * >MAX_ORDER-1 order is encountered.
>> +			 */
>> +			if (order > MAX_ORDER-1)
>> +				order = 0;
>> +			while (!PageBuddy(pfn_to_page(pfn))) {
>> +				order++;
>> +				pfn &= ~0UL << order;
>> +			}
>> +			continue;
>> +#else
>> +			goto failed;
>> +#endif
>> +		}
>> +
>> +		pfn++;
>> +	}
>> +	return 0;
>> +failed:
>> +	/* restore the original migratetype */
>> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
>> +	return -EBUSY;
>> +}
>> +
>>   /**
>>    * start_isolate_page_range() - make page-allocation-type of range of pages to
>>    * be MIGRATE_ISOLATE.
>> @@ -293,6 +444,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>    *					 and PageOffline() pages.
>>    *			REPORT_FAILURE - report details about the failure to
>>    *			isolate the range
>> + * @gfp_flags:		GFP flags used for migrating pages that sit across the
>> + *			range boundaries.
>>    *
>>    * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>>    * the range will never be allocated. Any free pages and pages freed in the
>> @@ -301,6 +454,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>    * pages in the range finally, the caller have to free all pages in the range.
>>    * test_page_isolated() can be used for test it.
>>    *
>> + * The function first tries to isolate the pageblocks at the beginning and end
>> + * of the range, since there might be pages across the range boundaries.
>> + * Afterwards, it isolates the rest of the range.
>> + *
>>    * There is no high level synchronization mechanism that prevents two threads
>>    * from trying to isolate overlapping ranges. If this happens, one thread
>>    * will notice pageblocks in the overlapping range already set to isolate.
>> @@ -321,21 +478,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>    * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>>    */
>>   int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			     unsigned migratetype, int flags)
>> +			     int migratetype, int flags, gfp_t gfp_flags)
>>   {
>>   	unsigned long pfn;
>>   	struct page *page;
>> +	int ret;
>>
>>   	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>>   	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>>
>> -	for (pfn = start_pfn;
>> -	     pfn < end_pfn;
>> +	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
>> +	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
>> +	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
>> +	if (ret) {
>> +		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
>> +		return ret;
>> +	}
>> +
>> +	/* skip isolated pageblocks at the beginning and end */
>> +	for (pfn = start_pfn + pageblock_nr_pages;
>> +	     pfn < end_pfn - pageblock_nr_pages;
>>   	     pfn += pageblock_nr_pages) {
>>   		page = __first_valid_page(pfn, pageblock_nr_pages);
>>   		if (page && set_migratetype_isolate(page, migratetype, flags,
>>   					start_pfn, end_pfn)) {
>>   			undo_isolate_page_range(start_pfn, pfn, migratetype);
>> +			unset_migratetype_isolate(
>> +				pfn_to_page(end_pfn - pageblock_nr_pages),
>> +				migratetype);
>>   			return -EBUSY;
>>   		}
>>   	}
>> @@ -346,7 +520,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>    * Make isolated pages available again.
>>    */
>>   void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			    unsigned migratetype)
>> +			    int migratetype)
>>   {
>>   	unsigned long pfn;
>>   	struct page *page;
>> -- 
>> 2.35.1
> 
> Qian hit a bug caused by this series https://lore.kernel.org/linux-mm/20220426201855.GA1014@qian/
> and the fix is:
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 75e454f5cf45..b3f074d1682e 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -367,58 +367,67 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>   		}
>   		/*
>   		 * migrate compound pages then let the free page handling code
> -		 * above do the rest. If migration is not enabled, just fail.
> +		 * above do the rest. If migration is not possible, just fail.
>   		 */
> -		if (PageHuge(page) || PageTransCompound(page)) {
> -#if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +		if (PageCompound(page)) {
>   			unsigned long nr_pages = compound_nr(page);
> -			int order = compound_order(page);
>   			struct page *head = compound_head(page);
>   			unsigned long head_pfn = page_to_pfn(head);
> -			int ret;
> -			struct compact_control cc = {
> -				.nr_migratepages = 0,
> -				.order = -1,
> -				.zone = page_zone(pfn_to_page(head_pfn)),
> -				.mode = MIGRATE_SYNC,
> -				.ignore_skip_hint = true,
> -				.no_set_skip_hint = true,
> -				.gfp_mask = gfp_flags,
> -				.alloc_contig = true,
> -			};
> -			INIT_LIST_HEAD(&cc.migratepages);
> 
>   			if (head_pfn + nr_pages < boundary_pfn) {
> -				pfn += nr_pages;
> +				pfn = head_pfn + nr_pages;
>   				continue;
>   			}
> -
> -			ret = __alloc_contig_migrate_range(&cc, head_pfn,
> -						head_pfn + nr_pages);
> -
> -			if (ret)
> -				goto failed;
> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
>   			/*
> -			 * reset pfn, let the free page handling code above
> -			 * split the free page to the right migratetype list.
> -			 *
> -			 * head_pfn is not used here as a hugetlb page order
> -			 * can be bigger than MAX_ORDER-1, but after it is
> -			 * freed, the free page order is not. Use pfn within
> -			 * the range to find the head of the free page and
> -			 * reset order to 0 if a hugetlb page with
> -			 * >MAX_ORDER-1 order is encountered.
> +			 * hugetlb, lru compound (THP), and movable compound pages
> +			 * can be migrated. Otherwise, fail the isolation.
>   			 */
> -			if (order > MAX_ORDER-1)
> +			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
> +				int order;
> +				unsigned long outer_pfn;
> +				int ret;
> +				struct compact_control cc = {
> +					.nr_migratepages = 0,
> +					.order = -1,
> +					.zone = page_zone(pfn_to_page(head_pfn)),
> +					.mode = MIGRATE_SYNC,
> +					.ignore_skip_hint = true,
> +					.no_set_skip_hint = true,
> +					.gfp_mask = gfp_flags,
> +					.alloc_contig = true,
> +				};
> +				INIT_LIST_HEAD(&cc.migratepages);
> +
> +				ret = __alloc_contig_migrate_range(&cc, head_pfn,
> +							head_pfn + nr_pages);
> +
> +				if (ret)
> +					goto failed;
> +				/*
> +				 * reset pfn to the head of the free page, so
> +				 * that the free page handling code above can split
> +				 * the free page to the right migratetype list.
> +				 *
> +				 * head_pfn is not used here as a hugetlb page order
> +				 * can be bigger than MAX_ORDER-1, but after it is
> +				 * freed, the free page order is not. Use pfn within
> +				 * the range to find the head of the free page.
> +				 */
>   				order = 0;
> -			while (!PageBuddy(pfn_to_page(pfn))) {
> -				order++;
> -				pfn &= ~0UL << order;
> -			}
> -			continue;
> -#else
> -			goto failed;
> +				outer_pfn = pfn;
> +				while (!PageBuddy(pfn_to_page(outer_pfn))) {
> +					if (++order >= MAX_ORDER) {
> +						outer_pfn = pfn;
> +						break;
> +					}
> +					outer_pfn &= ~0UL << order;
> +				}
> +				pfn = outer_pfn;
> +				continue;
> +			} else
>   #endif
> +				goto failed;
>   		}
> 
>   		pfn++;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-05-25 17:41     ` Doug Berger
@ 2022-05-25 17:53       ` Zi Yan
  2022-05-25 21:03         ` Doug Berger
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-25 17:53 UTC (permalink / raw)
  To: Doug Berger
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton,
	kernel test robot, Qian Cai

[-- Attachment #1: Type: text/plain, Size: 26078 bytes --]

On 25 May 2022, at 13:41, Doug Berger wrote:

> I am seeing some free memory accounting problems with linux-next that I have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity").
>
> On an arm64 SMP platform with 4GB total memory and the default 16MB default CMA pool, I am seeing the following after boot with a sysrq Show Memory (e.g. 'echo m > /proc/sysrq-trigger'):
>
> [   16.015906] sysrq: Show Memory
> [   16.019039] Mem-Info:
> [   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
> [   16.021348]  active_file:0 inactive_file:0 isolated_file:0
> [   16.021348]  unevictable:0 dirty:0 writeback:0
> [   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
> [   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
> [   16.021348]  kernel_misc_reclaimable:0
> [   16.021348]  free:976766 free_pcp:991 free_cma:7017
> [   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
> [   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:28068kB
> [   16.108650] lowmem_reserve[]: 0 0 944 944
> [   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB high:3864kB reserved_highatomic:0KB active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
> [   16.140393] lowmem_reserve[]: 0 0 0 0
> [   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB (M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 732*4096kB (MC) = 3027724kB
> [   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB (ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 0*2048kB 210*4096kB (M) = 866028kB
> [   16.175165] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [   16.183937] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
> [   16.192533] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [   16.201040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
> [   16.209374] 15146 total pagecache pages
> [   16.213246] 0 pages in swap cache
> [   16.216595] Swap cache stats: add 0, delete 0, find 0/0
> [   16.221867] Free swap  = 0kB
> [   16.224780] Total swap = 0kB
> [   16.227693] 1048576 pages RAM
> [   16.230694] 0 pages HighMem/MovableOnly
> [   16.234564] 49240 pages reserved
> [   16.237825] 4096 pages cma reserved
>
> Some anomolies in the above are:
> free_cma:7017 with only 4096 pages cma reserved
> DMA free:3041036kB with only managed:3029992kB
>
> I'm not sure what is going on here, but I am suspicious of split_free_page() since del_page_from_free_list doesn't affect migrate_type accounting, but __free_one_page() can.
> Also PageBuddy(page) is being checked without zone->lock in isolate_single_pageblock().
>
> Please investigate this as well.


Can you try this patch https://lore.kernel.org/linux-mm/20220524194756.1698351-1-zi.yan@sent.com/
and see if it fixes the issue?

Thanks.

>
> Thanks!
>     Doug
>
> On 4/29/2022 6:54 AM, Zi Yan wrote:
>> On 25 Apr 2022, at 10:31, Zi Yan wrote:
>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
>>> merging pageblocks with different migratetypes. It might unnecessarily
>>> convert extra pageblocks at the beginning and at the end of the range.
>>> Change alloc_contig_range() to work at pageblock granularity.
>>>
>>> Special handling is needed for free pages and in-use pages across the
>>> boundaries of the range specified by alloc_contig_range(). Because these
>>> partially isolated pages causes free page accounting issues. The free
>>> pages will be split and freed into separate migratetype lists; the
>>> in-use pages will be migrated then the freed pages will be handled in
>>> the aforementioned way.
>>>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>   include/linux/page-isolation.h |   4 +-
>>>   mm/internal.h                  |   6 ++
>>>   mm/memory_hotplug.c            |   3 +-
>>>   mm/page_alloc.c                |  54 ++++++++--
>>>   mm/page_isolation.c            | 184 ++++++++++++++++++++++++++++++++-
>>>   5 files changed, 233 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>>> index e14eddf6741a..5456b7be38ae 100644
>>> --- a/include/linux/page-isolation.h
>>> +++ b/include/linux/page-isolation.h
>>> @@ -42,7 +42,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>>    */
>>>   int
>>>   start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> -			 unsigned migratetype, int flags);
>>> +			 int migratetype, int flags, gfp_t gfp_flags);
>>>
>>>   /*
>>>    * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
>>> @@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>    */
>>>   void
>>>   undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> -			unsigned migratetype);
>>> +			int migratetype);
>>>
>>>   /*
>>>    * Test all pages in [start_pfn, end_pfn) are isolated or not.
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 919fa07e1031..0667abd57634 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -359,6 +359,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>>>   			  phys_addr_t min_addr,
>>>   			  int nid, bool exact_nid);
>>>
>>> +void split_free_page(struct page *free_page,
>>> +				int order, unsigned long split_pfn_offset);
>>> +
>>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>>
>>>   /*
>>> @@ -422,6 +425,9 @@ isolate_freepages_range(struct compact_control *cc,
>>>   int
>>>   isolate_migratepages_range(struct compact_control *cc,
>>>   			   unsigned long low_pfn, unsigned long end_pfn);
>>> +
>>> +int __alloc_contig_migrate_range(struct compact_control *cc,
>>> +					unsigned long start, unsigned long end);
>>>   #endif
>>>   int find_suitable_fallback(struct free_area *area, unsigned int order,
>>>   			int migratetype, bool only_stealable, bool *can_steal);
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 4c6065e5d274..9f8ae4cb77ee 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1845,7 +1845,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>>   	/* set above range as isolated */
>>>   	ret = start_isolate_page_range(start_pfn, end_pfn,
>>>   				       MIGRATE_MOVABLE,
>>> -				       MEMORY_OFFLINE | REPORT_FAILURE);
>>> +				       MEMORY_OFFLINE | REPORT_FAILURE,
>>> +				       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>>>   	if (ret) {
>>>   		reason = "failure to isolate range";
>>>   		goto failed_removal_pcplists_disabled;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index ce23ac8ad085..70ddd9a0bcf3 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1094,6 +1094,43 @@ static inline void __free_one_page(struct page *page,
>>>   		page_reporting_notify_free(order);
>>>   }
>>>
>>> +/**
>>> + * split_free_page() -- split a free page at split_pfn_offset
>>> + * @free_page:		the original free page
>>> + * @order:		the order of the page
>>> + * @split_pfn_offset:	split offset within the page
>>> + *
>>> + * It is used when the free page crosses two pageblocks with different migratetypes
>>> + * at split_pfn_offset within the page. The split free page will be put into
>>> + * separate migratetype lists afterwards. Otherwise, the function achieves
>>> + * nothing.
>>> + */
>>> +void split_free_page(struct page *free_page,
>>> +				int order, unsigned long split_pfn_offset)
>>> +{
>>> +	struct zone *zone = page_zone(free_page);
>>> +	unsigned long free_page_pfn = page_to_pfn(free_page);
>>> +	unsigned long pfn;
>>> +	unsigned long flags;
>>> +	int free_page_order;
>>> +
>>> +	spin_lock_irqsave(&zone->lock, flags);
>>> +	del_page_from_free_list(free_page, zone, order);
>>> +	for (pfn = free_page_pfn;
>>> +	     pfn < free_page_pfn + (1UL << order);) {
>>> +		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>>> +
>>> +		free_page_order = ffs(split_pfn_offset) - 1;
>>> +		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
>>> +				mt, FPI_NONE);
>>> +		pfn += 1UL << free_page_order;
>>> +		split_pfn_offset -= (1UL << free_page_order);
>>> +		/* we have done the first part, now switch to second part */
>>> +		if (split_pfn_offset == 0)
>>> +			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>>> +	}
>>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>> +}
>>>   /*
>>>    * A bad page could be due to a number of fields. Instead of multiple branches,
>>>    * try and check multiple fields with one check. The caller must do a detailed
>>> @@ -8919,7 +8956,7 @@ static inline void alloc_contig_dump_pages(struct list_head *page_list)
>>>   #endif
>>>
>>>   /* [start, end) must belong to a single zone. */
>>> -static int __alloc_contig_migrate_range(struct compact_control *cc,
>>> +int __alloc_contig_migrate_range(struct compact_control *cc,
>>>   					unsigned long start, unsigned long end)
>>>   {
>>>   	/* This function is based on compact_zone() from compaction.c. */
>>> @@ -9002,7 +9039,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>>   		       unsigned migratetype, gfp_t gfp_mask)
>>>   {
>>>   	unsigned long outer_start, outer_end;
>>> -	unsigned int order;
>>> +	int order;
>>>   	int ret = 0;
>>>
>>>   	struct compact_control cc = {
>>> @@ -9021,14 +9058,11 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>>   	 * What we do here is we mark all pageblocks in range as
>>>   	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
>>>   	 * have different sizes, and due to the way page allocator
>>> -	 * work, we align the range to biggest of the two pages so
>>> -	 * that page allocator won't try to merge buddies from
>>> -	 * different pageblocks and change MIGRATE_ISOLATE to some
>>> -	 * other migration type.
>>> +	 * work, start_isolate_page_range() has special handlings for this.
>>>   	 *
>>>   	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
>>>   	 * migrate the pages from an unaligned range (ie. pages that
>>> -	 * we are interested in).  This will put all the pages in
>>> +	 * we are interested in). This will put all the pages in
>>>   	 * range back to page allocator as MIGRATE_ISOLATE.
>>>   	 *
>>>   	 * When this is done, we take the pages in range from page
>>> @@ -9042,9 +9076,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>>   	 */
>>>
>>>   	ret = start_isolate_page_range(pfn_max_align_down(start),
>>> -				       pfn_max_align_up(end), migratetype, 0);
>>> +				pfn_max_align_up(end), migratetype, 0, gfp_mask);
>>>   	if (ret)
>>> -		return ret;
>>> +		goto done;
>>>
>>>   	drain_all_pages(cc.zone);
>>>
>>> @@ -9064,7 +9098,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>>   	ret = 0;
>>>
>>>   	/*
>>> -	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
>>> +	 * Pages from [start, end) are within a pageblock_nr_pages
>>>   	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
>>>   	 * more, all pages in [start, end) are free in page allocator.
>>>   	 * What we are going to do is to allocate all pages from
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index c2f7a8bb634d..94b3467e5ba2 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -203,7 +203,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>>>   	return -EBUSY;
>>>   }
>>>
>>> -static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>>> +static void unset_migratetype_isolate(struct page *page, int migratetype)
>>>   {
>>>   	struct zone *zone;
>>>   	unsigned long flags, nr_pages;
>>> @@ -279,6 +279,157 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>   	return NULL;
>>>   }
>>>
>>> +/**
>>> + * isolate_single_pageblock() -- tries to isolate a pageblock that might be
>>> + * within a free or in-use page.
>>> + * @boundary_pfn:		pageblock-aligned pfn that a page might cross
>>> + * @gfp_flags:			GFP flags used for migrating pages
>>> + * @isolate_before:	isolate the pageblock before the boundary_pfn
>>> + *
>>> + * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>> + * pageblock. When not all pageblocks within a page are isolated at the same
>>> + * time, free page accounting can go wrong. For example, in the case of
>>> + * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
>>> + * [         MAX_ORDER-1         ]
>>> + * [  pageblock0  |  pageblock1  ]
>>> + * When either pageblock is isolated, if it is a free page, the page is not
>>> + * split into separate migratetype lists, which is supposed to; if it is an
>>> + * in-use page and freed later, __free_one_page() does not split the free page
>>> + * either. The function handles this by splitting the free page or migrating
>>> + * the in-use page then splitting the free page.
>>> + */
>>> +static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>> +			bool isolate_before)
>>> +{
>>> +	unsigned char saved_mt;
>>> +	unsigned long start_pfn;
>>> +	unsigned long isolate_pageblock;
>>> +	unsigned long pfn;
>>> +	struct zone *zone;
>>> +
>>> +	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
>>> +
>>> +	if (isolate_before)
>>> +		isolate_pageblock = boundary_pfn - pageblock_nr_pages;
>>> +	else
>>> +		isolate_pageblock = boundary_pfn;
>>> +
>>> +	/*
>>> +	 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
>>> +	 * only isolating a subset of pageblocks from a bigger than pageblock
>>> +	 * free or in-use page. Also make sure all to-be-isolated pageblocks
>>> +	 * are within the same zone.
>>> +	 */
>>> +	zone  = page_zone(pfn_to_page(isolate_pageblock));
>>> +	start_pfn  = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>> +				      zone->zone_start_pfn);
>>> +
>>> +	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
>>> +
>>> +	/*
>>> +	 * Bail out early when the to-be-isolated pageblock does not form
>>> +	 * a free or in-use page across boundary_pfn:
>>> +	 *
>>> +	 * 1. isolate before boundary_pfn: the page after is not online
>>> +	 * 2. isolate after boundary_pfn: the page before is not online
>>> +	 *
>>> +	 * This also ensures correctness. Without it, when isolate after
>>> +	 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
>>> +	 * __first_valid_page() will return unexpected NULL in the for loop
>>> +	 * below.
>>> +	 */
>>> +	if (isolate_before) {
>>> +		if (!pfn_to_online_page(boundary_pfn))
>>> +			return 0;
>>> +	} else {
>>> +		if (!pfn_to_online_page(boundary_pfn - 1))
>>> +			return 0;
>>> +	}
>>> +
>>> +	for (pfn = start_pfn; pfn < boundary_pfn;) {
>>> +		struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
>>> +
>>> +		VM_BUG_ON(!page);
>>> +		pfn = page_to_pfn(page);
>>> +		/*
>>> +		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
>>> +		 * free pages in [start_pfn, boundary_pfn), its head page will
>>> +		 * always be in the range.
>>> +		 */
>>> +		if (PageBuddy(page)) {
>>> +			int order = buddy_order(page);
>>> +
>>> +			if (pfn + (1UL << order) > boundary_pfn)
>>> +				split_free_page(page, order, boundary_pfn - pfn);
>>> +			pfn += (1UL << order);
>>> +			continue;
>>> +		}
>>> +		/*
>>> +		 * migrate compound pages then let the free page handling code
>>> +		 * above do the rest. If migration is not enabled, just fail.
>>> +		 */
>>> +		if (PageHuge(page) || PageTransCompound(page)) {
>>> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>> +			unsigned long nr_pages = compound_nr(page);
>>> +			int order = compound_order(page);
>>> +			struct page *head = compound_head(page);
>>> +			unsigned long head_pfn = page_to_pfn(head);
>>> +			int ret;
>>> +			struct compact_control cc = {
>>> +				.nr_migratepages = 0,
>>> +				.order = -1,
>>> +				.zone = page_zone(pfn_to_page(head_pfn)),
>>> +				.mode = MIGRATE_SYNC,
>>> +				.ignore_skip_hint = true,
>>> +				.no_set_skip_hint = true,
>>> +				.gfp_mask = gfp_flags,
>>> +				.alloc_contig = true,
>>> +			};
>>> +			INIT_LIST_HEAD(&cc.migratepages);
>>> +
>>> +			if (head_pfn + nr_pages < boundary_pfn) {
>>> +				pfn += nr_pages;
>>> +				continue;
>>> +			}
>>> +
>>> +			ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>> +						head_pfn + nr_pages);
>>> +
>>> +			if (ret)
>>> +				goto failed;
>>> +			/*
>>> +			 * reset pfn, let the free page handling code above
>>> +			 * split the free page to the right migratetype list.
>>> +			 *
>>> +			 * head_pfn is not used here as a hugetlb page order
>>> +			 * can be bigger than MAX_ORDER-1, but after it is
>>> +			 * freed, the free page order is not. Use pfn within
>>> +			 * the range to find the head of the free page and
>>> +			 * reset order to 0 if a hugetlb page with
>>> +			 * >MAX_ORDER-1 order is encountered.
>>> +			 */
>>> +			if (order > MAX_ORDER-1)
>>> +				order = 0;
>>> +			while (!PageBuddy(pfn_to_page(pfn))) {
>>> +				order++;
>>> +				pfn &= ~0UL << order;
>>> +			}
>>> +			continue;
>>> +#else
>>> +			goto failed;
>>> +#endif
>>> +		}
>>> +
>>> +		pfn++;
>>> +	}
>>> +	return 0;
>>> +failed:
>>> +	/* restore the original migratetype */
>>> +	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
>>> +	return -EBUSY;
>>> +}
>>> +
>>>   /**
>>>    * start_isolate_page_range() - make page-allocation-type of range of pages to
>>>    * be MIGRATE_ISOLATE.
>>> @@ -293,6 +444,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>    *					 and PageOffline() pages.
>>>    *			REPORT_FAILURE - report details about the failure to
>>>    *			isolate the range
>>> + * @gfp_flags:		GFP flags used for migrating pages that sit across the
>>> + *			range boundaries.
>>>    *
>>>    * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>>>    * the range will never be allocated. Any free pages and pages freed in the
>>> @@ -301,6 +454,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>    * pages in the range finally, the caller have to free all pages in the range.
>>>    * test_page_isolated() can be used for test it.
>>>    *
>>> + * The function first tries to isolate the pageblocks at the beginning and end
>>> + * of the range, since there might be pages across the range boundaries.
>>> + * Afterwards, it isolates the rest of the range.
>>> + *
>>>    * There is no high level synchronization mechanism that prevents two threads
>>>    * from trying to isolate overlapping ranges. If this happens, one thread
>>>    * will notice pageblocks in the overlapping range already set to isolate.
>>> @@ -321,21 +478,38 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>    * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>>>    */
>>>   int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> -			     unsigned migratetype, int flags)
>>> +			     int migratetype, int flags, gfp_t gfp_flags)
>>>   {
>>>   	unsigned long pfn;
>>>   	struct page *page;
>>> +	int ret;
>>>
>>>   	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>>>   	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>>>
>>> -	for (pfn = start_pfn;
>>> -	     pfn < end_pfn;
>>> +	/* isolate [start_pfn, start_pfn + pageblock_nr_pages) pageblock */
>>> +	ret = isolate_single_pageblock(start_pfn, gfp_flags, false);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	/* isolate [end_pfn - pageblock_nr_pages, end_pfn) pageblock */
>>> +	ret = isolate_single_pageblock(end_pfn, gfp_flags, true);
>>> +	if (ret) {
>>> +		unset_migratetype_isolate(pfn_to_page(start_pfn), migratetype);
>>> +		return ret;
>>> +	}
>>> +
>>> +	/* skip isolated pageblocks at the beginning and end */
>>> +	for (pfn = start_pfn + pageblock_nr_pages;
>>> +	     pfn < end_pfn - pageblock_nr_pages;
>>>   	     pfn += pageblock_nr_pages) {
>>>   		page = __first_valid_page(pfn, pageblock_nr_pages);
>>>   		if (page && set_migratetype_isolate(page, migratetype, flags,
>>>   					start_pfn, end_pfn)) {
>>>   			undo_isolate_page_range(start_pfn, pfn, migratetype);
>>> +			unset_migratetype_isolate(
>>> +				pfn_to_page(end_pfn - pageblock_nr_pages),
>>> +				migratetype);
>>>   			return -EBUSY;
>>>   		}
>>>   	}
>>> @@ -346,7 +520,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>    * Make isolated pages available again.
>>>    */
>>>   void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> -			    unsigned migratetype)
>>> +			    int migratetype)
>>>   {
>>>   	unsigned long pfn;
>>>   	struct page *page;
>>> -- 
>>> 2.35.1
>>
>> Qian hit a bug caused by this series https://lore.kernel.org/linux-mm/20220426201855.GA1014@qian/
>> and the fix is:
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index 75e454f5cf45..b3f074d1682e 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -367,58 +367,67 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>   		}
>>   		/*
>>   		 * migrate compound pages then let the free page handling code
>> -		 * above do the rest. If migration is not enabled, just fail.
>> +		 * above do the rest. If migration is not possible, just fail.
>>   		 */
>> -		if (PageHuge(page) || PageTransCompound(page)) {
>> -#if defined CONFIG_COMPACTION || defined CONFIG_CMA
>> +		if (PageCompound(page)) {
>>   			unsigned long nr_pages = compound_nr(page);
>> -			int order = compound_order(page);
>>   			struct page *head = compound_head(page);
>>   			unsigned long head_pfn = page_to_pfn(head);
>> -			int ret;
>> -			struct compact_control cc = {
>> -				.nr_migratepages = 0,
>> -				.order = -1,
>> -				.zone = page_zone(pfn_to_page(head_pfn)),
>> -				.mode = MIGRATE_SYNC,
>> -				.ignore_skip_hint = true,
>> -				.no_set_skip_hint = true,
>> -				.gfp_mask = gfp_flags,
>> -				.alloc_contig = true,
>> -			};
>> -			INIT_LIST_HEAD(&cc.migratepages);
>>
>>   			if (head_pfn + nr_pages < boundary_pfn) {
>> -				pfn += nr_pages;
>> +				pfn = head_pfn + nr_pages;
>>   				continue;
>>   			}
>> -
>> -			ret = __alloc_contig_migrate_range(&cc, head_pfn,
>> -						head_pfn + nr_pages);
>> -
>> -			if (ret)
>> -				goto failed;
>> +#if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>   			/*
>> -			 * reset pfn, let the free page handling code above
>> -			 * split the free page to the right migratetype list.
>> -			 *
>> -			 * head_pfn is not used here as a hugetlb page order
>> -			 * can be bigger than MAX_ORDER-1, but after it is
>> -			 * freed, the free page order is not. Use pfn within
>> -			 * the range to find the head of the free page and
>> -			 * reset order to 0 if a hugetlb page with
>> -			 * >MAX_ORDER-1 order is encountered.
>> +			 * hugetlb, lru compound (THP), and movable compound pages
>> +			 * can be migrated. Otherwise, fail the isolation.
>>   			 */
>> -			if (order > MAX_ORDER-1)
>> +			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
>> +				int order;
>> +				unsigned long outer_pfn;
>> +				int ret;
>> +				struct compact_control cc = {
>> +					.nr_migratepages = 0,
>> +					.order = -1,
>> +					.zone = page_zone(pfn_to_page(head_pfn)),
>> +					.mode = MIGRATE_SYNC,
>> +					.ignore_skip_hint = true,
>> +					.no_set_skip_hint = true,
>> +					.gfp_mask = gfp_flags,
>> +					.alloc_contig = true,
>> +				};
>> +				INIT_LIST_HEAD(&cc.migratepages);
>> +
>> +				ret = __alloc_contig_migrate_range(&cc, head_pfn,
>> +							head_pfn + nr_pages);
>> +
>> +				if (ret)
>> +					goto failed;
>> +				/*
>> +				 * reset pfn to the head of the free page, so
>> +				 * that the free page handling code above can split
>> +				 * the free page to the right migratetype list.
>> +				 *
>> +				 * head_pfn is not used here as a hugetlb page order
>> +				 * can be bigger than MAX_ORDER-1, but after it is
>> +				 * freed, the free page order is not. Use pfn within
>> +				 * the range to find the head of the free page.
>> +				 */
>>   				order = 0;
>> -			while (!PageBuddy(pfn_to_page(pfn))) {
>> -				order++;
>> -				pfn &= ~0UL << order;
>> -			}
>> -			continue;
>> -#else
>> -			goto failed;
>> +				outer_pfn = pfn;
>> +				while (!PageBuddy(pfn_to_page(outer_pfn))) {
>> +					if (++order >= MAX_ORDER) {
>> +						outer_pfn = pfn;
>> +						break;
>> +					}
>> +					outer_pfn &= ~0UL << order;
>> +				}
>> +				pfn = outer_pfn;
>> +				continue;
>> +			} else
>>   #endif
>> +				goto failed;
>>   		}
>>
>>   		pfn++;

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-05-25 17:53       ` Zi Yan
@ 2022-05-25 21:03         ` Doug Berger
  2022-05-25 21:11           ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Doug Berger @ 2022-05-25 21:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton,
	kernel test robot, Qian Cai

On 5/25/2022 10:53 AM, Zi Yan wrote:
> On 25 May 2022, at 13:41, Doug Berger wrote:
> 
>> I am seeing some free memory accounting problems with linux-next that I have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity").
>>
>> On an arm64 SMP platform with 4GB total memory and the default 16MB default CMA pool, I am seeing the following after boot with a sysrq Show Memory (e.g. 'echo m > /proc/sysrq-trigger'):
>>
>> [   16.015906] sysrq: Show Memory
>> [   16.019039] Mem-Info:
>> [   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
>> [   16.021348]  active_file:0 inactive_file:0 isolated_file:0
>> [   16.021348]  unevictable:0 dirty:0 writeback:0
>> [   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
>> [   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
>> [   16.021348]  kernel_misc_reclaimable:0
>> [   16.021348]  free:976766 free_pcp:991 free_cma:7017
>> [   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
>> [   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:28068kB
>> [   16.108650] lowmem_reserve[]: 0 0 944 944
>> [   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB high:3864kB reserved_highatomic:0KB active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
>> [   16.140393] lowmem_reserve[]: 0 0 0 0
>> [   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB (M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 732*4096kB (MC) = 3027724kB
>> [   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB (ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 0*2048kB 210*4096kB (M) = 866028kB
>> [   16.175165] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>> [   16.183937] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>> [   16.192533] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> [   16.201040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>> [   16.209374] 15146 total pagecache pages
>> [   16.213246] 0 pages in swap cache
>> [   16.216595] Swap cache stats: add 0, delete 0, find 0/0
>> [   16.221867] Free swap  = 0kB
>> [   16.224780] Total swap = 0kB
>> [   16.227693] 1048576 pages RAM
>> [   16.230694] 0 pages HighMem/MovableOnly
>> [   16.234564] 49240 pages reserved
>> [   16.237825] 4096 pages cma reserved
>>
>> Some anomolies in the above are:
>> free_cma:7017 with only 4096 pages cma reserved
>> DMA free:3041036kB with only managed:3029992kB
>>
>> I'm not sure what is going on here, but I am suspicious of split_free_page() since del_page_from_free_list doesn't affect migrate_type accounting, but __free_one_page() can.
>> Also PageBuddy(page) is being checked without zone->lock in isolate_single_pageblock().
>>
>> Please investigate this as well.
> 
> 
> Can you try this patch https://lore.kernel.org/linux-mm/20220524194756.1698351-1-zi.yan@sent.com/
> and see if it fixes the issue?
> 
> Thanks.
> 
The last hunk didn't apply directly to this commit, but I was able to 
apply the patch to linux-next/master with no improvement to the free 
memory accounting (actually anecdotaly worse):

[    6.236828] sysrq: Show Memory
[    6.239973] Mem-Info:
[    6.242290] active_anon:14594 inactive_anon:924 isolated_anon:0
[    6.242290]  active_file:0 inactive_file:0 isolated_file:0
[    6.242290]  unevictable:0 dirty:0 writeback:0
[    6.242290]  slab_reclaimable:3671 slab_unreclaimable:3575
[    6.242290]  mapped:935 shmem:15147 pagetables:63 bounce:0
[    6.242290]  kernel_misc_reclaimable:0
[    6.242290]  free:1059009 free_pcp:1067 free_cma:90112
[    6.278048] Node 0 active_anon:58376kB inactive_anon:3844kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:3740kB dirty:0kB writeback:0kB shmem:60588kB 
writeback_tmp:0kB kernel_stack:1216kB pagetables:252kB all_unreclaimable? no
[    6.279422] arm-scmi brcm_scmi@0: timed out in resp(caller: 
scmi_perf_level_set+0xe0/0x110)
[    6.302501] DMA free:3372200kB boost:0kB min:6032kB low:9040kB 
high:12048kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB 
present:3145728kB managed:3029800kB mlocked:0kB bounce:0kB 
free_pcp:636kB local_pcp:0kB free_cma:360448kB
[    6.302515] lowmem_reserve[]: 0 0 944
[    6.310894] cpufreq: __target_index: Failed to change cpu frequency: -110
[    6.337920]  944
[    6.337925] Normal free:863584kB boost:0kB min:1940kB low:2904kB 
high:3868kB reserved_highatomic:0KB active_anon:58376kB 
inactive_anon:3896kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:1048576kB managed:967352kB mlocked:0kB 
bounce:0kB free_pcp:3492kB local_pcp:828kB free_cma:0kB
[    6.377782] lowmem_reserve[]: 0 0 0 0
[    6.381461] DMA: 4*4kB (UM) 5*8kB (M) 3*16kB (M) 2*32kB (M) 6*64kB 
(M) 5*128kB (M) 6*256kB (UM) 5*512kB (UM) 4*1024kB (M) 10*2048kB (UMC) 
732*4096kB (MC) = 3028136kB
[    6.396324] Normal: 84*4kB (U) 94*8kB (UM) 260*16kB (UME) 149*32kB 
(UM) 99*64kB (UME) 39*128kB (UM) 12*256kB (U) 3*512kB (UME) 2*1024kB 
(UM) 0*2048kB 204*4096kB (M) = 863584kB
[    6.412054] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=1048576kB
[    6.420770] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=32768kB
[    6.429312] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB
[    6.437767] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=64kB
[    6.446047] 15147 total pagecache pages
[    6.449890] 0 pages in swap cache
[    6.453210] Swap cache stats: add 0, delete 0, find 0/0
[    6.458445] Free swap  = 0kB
[    6.461331] Total swap = 0kB
[    6.464217] 1048576 pages RAM
[    6.467190] 0 pages HighMem/MovableOnly
[    6.471032] 49288 pages reserved
[    6.474267] 4096 pages cma reserved

Regards,
     Doug

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-05-25 21:03         ` Doug Berger
@ 2022-05-25 21:11           ` Zi Yan
  2022-05-26 17:34             ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-25 21:11 UTC (permalink / raw)
  To: Doug Berger
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton,
	kernel test robot, Qian Cai

[-- Attachment #1: Type: text/plain, Size: 6968 bytes --]

On 25 May 2022, at 17:03, Doug Berger wrote:

> On 5/25/2022 10:53 AM, Zi Yan wrote:
>> On 25 May 2022, at 13:41, Doug Berger wrote:
>>
>>> I am seeing some free memory accounting problems with linux-next that I have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity").
>>>
>>> On an arm64 SMP platform with 4GB total memory and the default 16MB default CMA pool, I am seeing the following after boot with a sysrq Show Memory (e.g. 'echo m > /proc/sysrq-trigger'):
>>>
>>> [   16.015906] sysrq: Show Memory
>>> [   16.019039] Mem-Info:
>>> [   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
>>> [   16.021348]  active_file:0 inactive_file:0 isolated_file:0
>>> [   16.021348]  unevictable:0 dirty:0 writeback:0
>>> [   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
>>> [   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
>>> [   16.021348]  kernel_misc_reclaimable:0
>>> [   16.021348]  free:976766 free_pcp:991 free_cma:7017
>>> [   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
>>> [   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:28068kB
>>> [   16.108650] lowmem_reserve[]: 0 0 944 944
>>> [   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB high:3864kB reserved_highatomic:0KB active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
>>> [   16.140393] lowmem_reserve[]: 0 0 0 0
>>> [   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB (M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 732*4096kB (MC) = 3027724kB
>>> [   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB (ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 0*2048kB 210*4096kB (M) = 866028kB
>>> [   16.175165] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> [   16.183937] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>>> [   16.192533] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> [   16.201040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>>> [   16.209374] 15146 total pagecache pages
>>> [   16.213246] 0 pages in swap cache
>>> [   16.216595] Swap cache stats: add 0, delete 0, find 0/0
>>> [   16.221867] Free swap  = 0kB
>>> [   16.224780] Total swap = 0kB
>>> [   16.227693] 1048576 pages RAM
>>> [   16.230694] 0 pages HighMem/MovableOnly
>>> [   16.234564] 49240 pages reserved
>>> [   16.237825] 4096 pages cma reserved
>>>
>>> Some anomolies in the above are:
>>> free_cma:7017 with only 4096 pages cma reserved
>>> DMA free:3041036kB with only managed:3029992kB
>>>
>>> I'm not sure what is going on here, but I am suspicious of split_free_page() since del_page_from_free_list doesn't affect migrate_type accounting, but __free_one_page() can.
>>> Also PageBuddy(page) is being checked without zone->lock in isolate_single_pageblock().
>>>
>>> Please investigate this as well.
>>
>>
>> Can you try this patch https://lore.kernel.org/linux-mm/20220524194756.1698351-1-zi.yan@sent.com/
>> and see if it fixes the issue?
>>
>> Thanks.
>>
> The last hunk didn't apply directly to this commit, but I was able to apply the patch to linux-next/master with no improvement to the free memory accounting (actually anecdotaly worse):
>
> [    6.236828] sysrq: Show Memory
> [    6.239973] Mem-Info:
> [    6.242290] active_anon:14594 inactive_anon:924 isolated_anon:0
> [    6.242290]  active_file:0 inactive_file:0 isolated_file:0
> [    6.242290]  unevictable:0 dirty:0 writeback:0
> [    6.242290]  slab_reclaimable:3671 slab_unreclaimable:3575
> [    6.242290]  mapped:935 shmem:15147 pagetables:63 bounce:0
> [    6.242290]  kernel_misc_reclaimable:0
> [    6.242290]  free:1059009 free_pcp:1067 free_cma:90112
> [    6.278048] Node 0 active_anon:58376kB inactive_anon:3844kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3740kB dirty:0kB writeback:0kB shmem:60588kB writeback_tmp:0kB kernel_stack:1216kB pagetables:252kB all_unreclaimable? no
> [    6.279422] arm-scmi brcm_scmi@0: timed out in resp(caller: scmi_perf_level_set+0xe0/0x110)
> [    6.302501] DMA free:3372200kB boost:0kB min:6032kB low:9040kB high:12048kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029800kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:360448kB
> [    6.302515] lowmem_reserve[]: 0 0 944
> [    6.310894] cpufreq: __target_index: Failed to change cpu frequency: -110
> [    6.337920]  944
> [    6.337925] Normal free:863584kB boost:0kB min:1940kB low:2904kB high:3868kB reserved_highatomic:0KB active_anon:58376kB inactive_anon:3896kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3492kB local_pcp:828kB free_cma:0kB
> [    6.377782] lowmem_reserve[]: 0 0 0 0
> [    6.381461] DMA: 4*4kB (UM) 5*8kB (M) 3*16kB (M) 2*32kB (M) 6*64kB (M) 5*128kB (M) 6*256kB (UM) 5*512kB (UM) 4*1024kB (M) 10*2048kB (UMC) 732*4096kB (MC) = 3028136kB
> [    6.396324] Normal: 84*4kB (U) 94*8kB (UM) 260*16kB (UME) 149*32kB (UM) 99*64kB (UME) 39*128kB (UM) 12*256kB (U) 3*512kB (UME) 2*1024kB (UM) 0*2048kB 204*4096kB (M) = 863584kB
> [    6.412054] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [    6.420770] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
> [    6.429312] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [    6.437767] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
> [    6.446047] 15147 total pagecache pages
> [    6.449890] 0 pages in swap cache
> [    6.453210] Swap cache stats: add 0, delete 0, find 0/0
> [    6.458445] Free swap  = 0kB
> [    6.461331] Total swap = 0kB
> [    6.464217] 1048576 pages RAM
> [    6.467190] 0 pages HighMem/MovableOnly
> [    6.471032] 49288 pages reserved
> [    6.474267] 4096 pages cma reserved
>
> Regards,
>     Doug

I will look into it. Thanks for reporting it.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-05-25 21:11           ` Zi Yan
@ 2022-05-26 17:34             ` Zi Yan
  2022-05-26 19:46               ` Doug Berger
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2022-05-26 17:34 UTC (permalink / raw)
  To: Doug Berger
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton,
	kernel test robot, Qian Cai

[-- Attachment #1: Type: text/plain, Size: 12972 bytes --]

On 25 May 2022, at 17:11, Zi Yan wrote:

> On 25 May 2022, at 17:03, Doug Berger wrote:
>
>> On 5/25/2022 10:53 AM, Zi Yan wrote:
>>> On 25 May 2022, at 13:41, Doug Berger wrote:
>>>
>>>> I am seeing some free memory accounting problems with linux-next that I have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity").
>>>>
>>>> On an arm64 SMP platform with 4GB total memory and the default 16MB default CMA pool, I am seeing the following after boot with a sysrq Show Memory (e.g. 'echo m > /proc/sysrq-trigger'):
>>>>
>>>> [   16.015906] sysrq: Show Memory
>>>> [   16.019039] Mem-Info:
>>>> [   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
>>>> [   16.021348]  active_file:0 inactive_file:0 isolated_file:0
>>>> [   16.021348]  unevictable:0 dirty:0 writeback:0
>>>> [   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
>>>> [   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
>>>> [   16.021348]  kernel_misc_reclaimable:0
>>>> [   16.021348]  free:976766 free_pcp:991 free_cma:7017
>>>> [   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
>>>> [   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:28068kB
>>>> [   16.108650] lowmem_reserve[]: 0 0 944 944
>>>> [   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB high:3864kB reserved_highatomic:0KB active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
>>>> [   16.140393] lowmem_reserve[]: 0 0 0 0
>>>> [   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB (M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 732*4096kB (MC) = 3027724kB
>>>> [   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB (ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 0*2048kB 210*4096kB (M) = 866028kB
>>>> [   16.175165] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>>> [   16.183937] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>>>> [   16.192533] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>>> [   16.201040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>>>> [   16.209374] 15146 total pagecache pages
>>>> [   16.213246] 0 pages in swap cache
>>>> [   16.216595] Swap cache stats: add 0, delete 0, find 0/0
>>>> [   16.221867] Free swap  = 0kB
>>>> [   16.224780] Total swap = 0kB
>>>> [   16.227693] 1048576 pages RAM
>>>> [   16.230694] 0 pages HighMem/MovableOnly
>>>> [   16.234564] 49240 pages reserved
>>>> [   16.237825] 4096 pages cma reserved
>>>>
>>>> Some anomolies in the above are:
>>>> free_cma:7017 with only 4096 pages cma reserved
>>>> DMA free:3041036kB with only managed:3029992kB
>>>>
>>>> I'm not sure what is going on here, but I am suspicious of split_free_page() since del_page_from_free_list doesn't affect migrate_type accounting, but __free_one_page() can.
>>>> Also PageBuddy(page) is being checked without zone->lock in isolate_single_pageblock().
>>>>
>>>> Please investigate this as well.
>>>
>>>
>>> Can you try this patch https://lore.kernel.org/linux-mm/20220524194756.1698351-1-zi.yan@sent.com/
>>> and see if it fixes the issue?
>>>
>>> Thanks.
>>>
>> The last hunk didn't apply directly to this commit, but I was able to apply the patch to linux-next/master with no improvement to the free memory accounting (actually anecdotaly worse):
>>
>> [    6.236828] sysrq: Show Memory
>> [    6.239973] Mem-Info:
>> [    6.242290] active_anon:14594 inactive_anon:924 isolated_anon:0
>> [    6.242290]  active_file:0 inactive_file:0 isolated_file:0
>> [    6.242290]  unevictable:0 dirty:0 writeback:0
>> [    6.242290]  slab_reclaimable:3671 slab_unreclaimable:3575
>> [    6.242290]  mapped:935 shmem:15147 pagetables:63 bounce:0
>> [    6.242290]  kernel_misc_reclaimable:0
>> [    6.242290]  free:1059009 free_pcp:1067 free_cma:90112
>> [    6.278048] Node 0 active_anon:58376kB inactive_anon:3844kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3740kB dirty:0kB writeback:0kB shmem:60588kB writeback_tmp:0kB kernel_stack:1216kB pagetables:252kB all_unreclaimable? no
>> [    6.279422] arm-scmi brcm_scmi@0: timed out in resp(caller: scmi_perf_level_set+0xe0/0x110)
>> [    6.302501] DMA free:3372200kB boost:0kB min:6032kB low:9040kB high:12048kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029800kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:360448kB
>> [    6.302515] lowmem_reserve[]: 0 0 944
>> [    6.310894] cpufreq: __target_index: Failed to change cpu frequency: -110
>> [    6.337920]  944
>> [    6.337925] Normal free:863584kB boost:0kB min:1940kB low:2904kB high:3868kB reserved_highatomic:0KB active_anon:58376kB inactive_anon:3896kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3492kB local_pcp:828kB free_cma:0kB
>> [    6.377782] lowmem_reserve[]: 0 0 0 0
>> [    6.381461] DMA: 4*4kB (UM) 5*8kB (M) 3*16kB (M) 2*32kB (M) 6*64kB (M) 5*128kB (M) 6*256kB (UM) 5*512kB (UM) 4*1024kB (M) 10*2048kB (UMC) 732*4096kB (MC) = 3028136kB
>> [    6.396324] Normal: 84*4kB (U) 94*8kB (UM) 260*16kB (UME) 149*32kB (UM) 99*64kB (UME) 39*128kB (UM) 12*256kB (U) 3*512kB (UME) 2*1024kB (UM) 0*2048kB 204*4096kB (M) = 863584kB
>> [    6.412054] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>> [    6.420770] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>> [    6.429312] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> [    6.437767] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>> [    6.446047] 15147 total pagecache pages
>> [    6.449890] 0 pages in swap cache
>> [    6.453210] Swap cache stats: add 0, delete 0, find 0/0
>> [    6.458445] Free swap  = 0kB
>> [    6.461331] Total swap = 0kB
>> [    6.464217] 1048576 pages RAM
>> [    6.467190] 0 pages HighMem/MovableOnly
>> [    6.471032] 49288 pages reserved
>> [    6.474267] 4096 pages cma reserved
>>
>> Regards,
>>     Doug
>
> I will look into it. Thanks for reporting it.

Hi Doug,

Can you try the patch below? It takes out free pages under zone lock now
and modifies page stats properly. Thanks.


diff --git a/mm/internal.h b/mm/internal.h
index 64e61b032dac..c0f8fbe0445b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -374,8 +374,8 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 			  phys_addr_t min_addr,
 			  int nid, bool exact_nid);

-void split_free_page(struct page *free_page,
-				int order, unsigned long split_pfn_offset);
+int split_free_page(struct page *free_page,
+			unsigned int order, unsigned long split_pfn_offset);

 #if defined CONFIG_COMPACTION || defined CONFIG_CMA

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc93a82e51e6..6f6e4649ac21 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1100,30 +1100,44 @@ static inline void __free_one_page(struct page *page,
  * @order:		the order of the page
  * @split_pfn_offset:	split offset within the page
  *
+ * Return -ENOENT if the free page is changed, otherwise 0
+ *
  * It is used when the free page crosses two pageblocks with different migratetypes
  * at split_pfn_offset within the page. The split free page will be put into
  * separate migratetype lists afterwards. Otherwise, the function achieves
  * nothing.
  */
-void split_free_page(struct page *free_page,
-				int order, unsigned long split_pfn_offset)
+int split_free_page(struct page *free_page,
+			unsigned int order, unsigned long split_pfn_offset)
 {
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
 	unsigned long flags;
 	int free_page_order;
+	int mt;
+	int ret = 0;

 	if (split_pfn_offset == 0)
-		return;
+		return ret;

 	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	mt = get_pageblock_migratetype(free_page);
+	if (likely(!is_migrate_isolate(mt)))
+		__mod_zone_freepage_state(zone, -(1UL << order), mt);
+
 	del_page_from_free_list(free_page, zone, order);
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

-		free_page_order = min_t(int,
+		free_page_order = min_t(unsigned int,
 					pfn ? __ffs(pfn) : order,
 					__fls(split_pfn_offset));
 		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
@@ -1134,7 +1148,9 @@ void split_free_page(struct page *free_page,
 		if (split_pfn_offset == 0)
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
+out:
 	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
 }
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c643c8420809..f539ccf7fb44 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -300,7 +300,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * the in-use page then splitting the free page.
  */
 static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
-			gfp_t gfp_flags, bool isolate_before)
+			gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
 {
 	unsigned char saved_mt;
 	unsigned long start_pfn;
@@ -327,11 +327,16 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				      zone->zone_start_pfn);

 	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
-			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);

-	if (ret)
-		return ret;
+	if (skip_isolation)
+		VM_BUG_ON(!is_migrate_isolate(saved_mt));
+	else {
+		ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+				isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+
+		if (ret)
+			return ret;
+	}

 	/*
 	 * Bail out early when the to-be-isolated pageblock does not form
@@ -367,8 +372,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			int order = buddy_order(page);

 			if (pfn + (1UL << order) > boundary_pfn)
-				split_free_page(page, order, boundary_pfn - pfn);
-			pfn += (1UL << order);
+				/* free page changed before split, check it again */
+				if (split_free_page(page, order, boundary_pfn - pfn))
+				    continue;
+
+			pfn += 1UL << order;
 			continue;
 		}
 		/*
@@ -463,7 +471,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 	return 0;
 failed:
 	/* restore the original migratetype */
-	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
+	if (!skip_isolation)
+		unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
 	return -EBUSY;
 }

@@ -522,14 +531,18 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
 	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
 	int ret;
+	bool skip_isolation = false;

 	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
+	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
 	if (ret)
 		return ret;

+	if (isolate_start == isolate_end - pageblock_nr_pages)
+		skip_isolation = true;
+
 	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
 	if (ret) {
 		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity
  2022-05-26 17:34             ` Zi Yan
@ 2022-05-26 19:46               ` Doug Berger
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Berger @ 2022-05-26 19:46 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, virtualization,
	Vlastimil Babka, Mel Gorman, Eric Ren, Mike Rapoport,
	Oscar Salvador, Christophe Leroy, Andrew Morton,
	kernel test robot, Qian Cai

On 5/26/2022 10:34 AM, Zi Yan wrote:
> On 25 May 2022, at 17:11, Zi Yan wrote:
> 
>> On 25 May 2022, at 17:03, Doug Berger wrote:
>>
>>> On 5/25/2022 10:53 AM, Zi Yan wrote:
>>>> On 25 May 2022, at 13:41, Doug Berger wrote:
>>>>
>>>>> I am seeing some free memory accounting problems with linux-next that I have bisected to this commit (i.e. b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity").
>>>>>
>>>>> On an arm64 SMP platform with 4GB total memory and the default 16MB default CMA pool, I am seeing the following after boot with a sysrq Show Memory (e.g. 'echo m > /proc/sysrq-trigger'):
>>>>>
>>>>> [   16.015906] sysrq: Show Memory
>>>>> [   16.019039] Mem-Info:
>>>>> [   16.021348] active_anon:14604 inactive_anon:919 isolated_anon:0
>>>>> [   16.021348]  active_file:0 inactive_file:0 isolated_file:0
>>>>> [   16.021348]  unevictable:0 dirty:0 writeback:0
>>>>> [   16.021348]  slab_reclaimable:3662 slab_unreclaimable:3333
>>>>> [   16.021348]  mapped:928 shmem:15146 pagetables:63 bounce:0
>>>>> [   16.021348]  kernel_misc_reclaimable:0
>>>>> [   16.021348]  free:976766 free_pcp:991 free_cma:7017
>>>>> [   16.056937] Node 0 active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3712kB dirty:0kB writeback:0kB shmem:60584kB writeback_tmp:0kB kernel_stack:1200kB pagetables:252kB all_unreclaimable? no
>>>>> [   16.081526] DMA free:3041036kB boost:0kB min:6036kB low:9044kB high:12052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029992kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:28068kB
>>>>> [   16.108650] lowmem_reserve[]: 0 0 944 944
>>>>> [   16.112746] Normal free:866028kB boost:0kB min:1936kB low:2900kB high:3864kB reserved_highatomic:0KB active_anon:58416kB inactive_anon:3676kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3328kB local_pcp:864kB free_cma:0kB
>>>>> [   16.140393] lowmem_reserve[]: 0 0 0 0
>>>>> [   16.144133] DMA: 7*4kB (UMC) 4*8kB (M) 3*16kB (M) 3*32kB (MC) 5*64kB (M) 4*128kB (MC) 5*256kB (UMC) 7*512kB (UM) 5*1024kB (UM) 9*2048kB (UMC) 732*4096kB (MC) = 3027724kB
>>>>> [   16.159609] Normal: 149*4kB (UM) 95*8kB (UME) 26*16kB (UME) 8*32kB (ME) 2*64kB (UE) 1*128kB (M) 2*256kB (ME) 2*512kB (ME) 2*1024kB (UM) 0*2048kB 210*4096kB (M) = 866028kB
>>>>> [   16.175165] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>>>> [   16.183937] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>>>>> [   16.192533] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>>>> [   16.201040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>>>>> [   16.209374] 15146 total pagecache pages
>>>>> [   16.213246] 0 pages in swap cache
>>>>> [   16.216595] Swap cache stats: add 0, delete 0, find 0/0
>>>>> [   16.221867] Free swap  = 0kB
>>>>> [   16.224780] Total swap = 0kB
>>>>> [   16.227693] 1048576 pages RAM
>>>>> [   16.230694] 0 pages HighMem/MovableOnly
>>>>> [   16.234564] 49240 pages reserved
>>>>> [   16.237825] 4096 pages cma reserved
>>>>>
>>>>> Some anomolies in the above are:
>>>>> free_cma:7017 with only 4096 pages cma reserved
>>>>> DMA free:3041036kB with only managed:3029992kB
>>>>>
>>>>> I'm not sure what is going on here, but I am suspicious of split_free_page() since del_page_from_free_list doesn't affect migrate_type accounting, but __free_one_page() can.
>>>>> Also PageBuddy(page) is being checked without zone->lock in isolate_single_pageblock().
>>>>>
>>>>> Please investigate this as well.
>>>>
>>>>
>>>> Can you try this patch https://lore.kernel.org/linux-mm/20220524194756.1698351-1-zi.yan@sent.com/
>>>> and see if it fixes the issue?
>>>>
>>>> Thanks.
>>>>
>>> The last hunk didn't apply directly to this commit, but I was able to apply the patch to linux-next/master with no improvement to the free memory accounting (actually anecdotaly worse):
>>>
>>> [    6.236828] sysrq: Show Memory
>>> [    6.239973] Mem-Info:
>>> [    6.242290] active_anon:14594 inactive_anon:924 isolated_anon:0
>>> [    6.242290]  active_file:0 inactive_file:0 isolated_file:0
>>> [    6.242290]  unevictable:0 dirty:0 writeback:0
>>> [    6.242290]  slab_reclaimable:3671 slab_unreclaimable:3575
>>> [    6.242290]  mapped:935 shmem:15147 pagetables:63 bounce:0
>>> [    6.242290]  kernel_misc_reclaimable:0
>>> [    6.242290]  free:1059009 free_pcp:1067 free_cma:90112
>>> [    6.278048] Node 0 active_anon:58376kB inactive_anon:3844kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3740kB dirty:0kB writeback:0kB shmem:60588kB writeback_tmp:0kB kernel_stack:1216kB pagetables:252kB all_unreclaimable? no
>>> [    6.279422] arm-scmi brcm_scmi@0: timed out in resp(caller: scmi_perf_level_set+0xe0/0x110)
>>> [    6.302501] DMA free:3372200kB boost:0kB min:6032kB low:9040kB high:12048kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3145728kB managed:3029800kB mlocked:0kB bounce:0kB free_pcp:636kB local_pcp:0kB free_cma:360448kB
>>> [    6.302515] lowmem_reserve[]: 0 0 944
>>> [    6.310894] cpufreq: __target_index: Failed to change cpu frequency: -110
>>> [    6.337920]  944
>>> [    6.337925] Normal free:863584kB boost:0kB min:1940kB low:2904kB high:3868kB reserved_highatomic:0KB active_anon:58376kB inactive_anon:3896kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:967352kB mlocked:0kB bounce:0kB free_pcp:3492kB local_pcp:828kB free_cma:0kB
>>> [    6.377782] lowmem_reserve[]: 0 0 0 0
>>> [    6.381461] DMA: 4*4kB (UM) 5*8kB (M) 3*16kB (M) 2*32kB (M) 6*64kB (M) 5*128kB (M) 6*256kB (UM) 5*512kB (UM) 4*1024kB (M) 10*2048kB (UMC) 732*4096kB (MC) = 3028136kB
>>> [    6.396324] Normal: 84*4kB (U) 94*8kB (UM) 260*16kB (UME) 149*32kB (UM) 99*64kB (UME) 39*128kB (UM) 12*256kB (U) 3*512kB (UME) 2*1024kB (UM) 0*2048kB 204*4096kB (M) = 863584kB
>>> [    6.412054] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> [    6.420770] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
>>> [    6.429312] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> [    6.437767] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
>>> [    6.446047] 15147 total pagecache pages
>>> [    6.449890] 0 pages in swap cache
>>> [    6.453210] Swap cache stats: add 0, delete 0, find 0/0
>>> [    6.458445] Free swap  = 0kB
>>> [    6.461331] Total swap = 0kB
>>> [    6.464217] 1048576 pages RAM
>>> [    6.467190] 0 pages HighMem/MovableOnly
>>> [    6.471032] 49288 pages reserved
>>> [    6.474267] 4096 pages cma reserved
>>>
>>> Regards,
>>>      Doug
>>
>> I will look into it. Thanks for reporting it.
> 
> Hi Doug,
> 
> Can you try the patch below? It takes out free pages under zone lock now
> and modifies page stats properly. Thanks.
> 
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 64e61b032dac..c0f8fbe0445b 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -374,8 +374,8 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>   			  phys_addr_t min_addr,
>   			  int nid, bool exact_nid);
> 
> -void split_free_page(struct page *free_page,
> -				int order, unsigned long split_pfn_offset);
> +int split_free_page(struct page *free_page,
> +			unsigned int order, unsigned long split_pfn_offset);
> 
>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bc93a82e51e6..6f6e4649ac21 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1100,30 +1100,44 @@ static inline void __free_one_page(struct page *page,
>    * @order:		the order of the page
>    * @split_pfn_offset:	split offset within the page
>    *
> + * Return -ENOENT if the free page is changed, otherwise 0
> + *
>    * It is used when the free page crosses two pageblocks with different migratetypes
>    * at split_pfn_offset within the page. The split free page will be put into
>    * separate migratetype lists afterwards. Otherwise, the function achieves
>    * nothing.
>    */
> -void split_free_page(struct page *free_page,
> -				int order, unsigned long split_pfn_offset)
> +int split_free_page(struct page *free_page,
> +			unsigned int order, unsigned long split_pfn_offset)
>   {
>   	struct zone *zone = page_zone(free_page);
>   	unsigned long free_page_pfn = page_to_pfn(free_page);
>   	unsigned long pfn;
>   	unsigned long flags;
>   	int free_page_order;
> +	int mt;
> +	int ret = 0;
> 
>   	if (split_pfn_offset == 0)
> -		return;
> +		return ret;
> 
>   	spin_lock_irqsave(&zone->lock, flags);
> +
> +	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	mt = get_pageblock_migratetype(free_page);
> +	if (likely(!is_migrate_isolate(mt)))
> +		__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +
>   	del_page_from_free_list(free_page, zone, order);
>   	for (pfn = free_page_pfn;
>   	     pfn < free_page_pfn + (1UL << order);) {
>   		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> 
> -		free_page_order = min_t(int,
> +		free_page_order = min_t(unsigned int,
>   					pfn ? __ffs(pfn) : order,
>   					__fls(split_pfn_offset));
This part of the patch doesn't agree with any version of page_alloc.c I 
have, but I was able to manually apply the change.


>   		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
> @@ -1134,7 +1148,9 @@ void split_free_page(struct page *free_page,
>   		if (split_pfn_offset == 0)
>   			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>   	}
> +out:
>   	spin_unlock_irqrestore(&zone->lock, flags);
> +	return ret;
>   }
>   /*
>    * A bad page could be due to a number of fields. Instead of multiple branches,
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c643c8420809..f539ccf7fb44 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -300,7 +300,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>    * the in-use page then splitting the free page.
>    */
>   static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> -			gfp_t gfp_flags, bool isolate_before)
> +			gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>   {
>   	unsigned char saved_mt;
>   	unsigned long start_pfn;
> @@ -327,11 +327,16 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>   				      zone->zone_start_pfn);
> 
>   	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
> -	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
> -			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
> 
> -	if (ret)
> -		return ret;
> +	if (skip_isolation)
> +		VM_BUG_ON(!is_migrate_isolate(saved_mt));
> +	else {
> +		ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
> +				isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
> +
> +		if (ret)
> +			return ret;
> +	}
> 
>   	/*
>   	 * Bail out early when the to-be-isolated pageblock does not form
> @@ -367,8 +372,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>   			int order = buddy_order(page);
> 
>   			if (pfn + (1UL << order) > boundary_pfn)
> -				split_free_page(page, order, boundary_pfn - pfn);
> -			pfn += (1UL << order);
> +				/* free page changed before split, check it again */
> +				if (split_free_page(page, order, boundary_pfn - pfn))
> +				    continue;
> +
> +			pfn += 1UL << order;
>   			continue;
>   		}
>   		/*
> @@ -463,7 +471,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>   	return 0;
>   failed:
>   	/* restore the original migratetype */
> -	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
> +	if (!skip_isolation)
> +		unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>   	return -EBUSY;
>   }
> 
> @@ -522,14 +531,18 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>   	unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>   	unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>   	int ret;
> +	bool skip_isolation = false;
> 
>   	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
> -	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
> +	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>   	if (ret)
>   		return ret;
> 
> +	if (isolate_start == isolate_end - pageblock_nr_pages)
> +		skip_isolation = true;
> +
>   	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> -	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
> +	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>   	if (ret) {
>   		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>   		return ret;
> 
> 
> --
> Best Regards,
> Yan, Zi

This patch does appear to fix the problem I observed. I'll poke it a 
little more, but so far it looks good.

Thanks!,
     Doug

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-05-26 19:46 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-25 14:31 [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Zi Yan
2022-04-25 14:31 ` [PATCH v11 1/6] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c Zi Yan
2022-04-25 14:31 ` [PATCH v11 2/6] mm: page_isolation: check specified range for unmovable pages Zi Yan
2022-04-25 14:31 ` [PATCH v11 3/6] mm: make alloc_contig_range work at pageblock granularity Zi Yan
2022-04-29 13:54   ` Zi Yan
2022-05-24 19:00     ` Zi Yan
2022-05-25 17:41     ` Doug Berger
2022-05-25 17:53       ` Zi Yan
2022-05-25 21:03         ` Doug Berger
2022-05-25 21:11           ` Zi Yan
2022-05-26 17:34             ` Zi Yan
2022-05-26 19:46               ` Doug Berger
2022-04-25 14:31 ` [PATCH v11 4/6] mm: page_isolation: enable arbitrary range page isolation Zi Yan
2022-05-24 19:02   ` Zi Yan
2022-04-25 14:31 ` [PATCH v11 5/6] mm: cma: use pageblock_order as the single alignment Zi Yan
2022-04-25 14:31 ` [PATCH v11 6/6] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size Zi Yan
2022-04-26 20:18 ` [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment Qian Cai
2022-04-26 20:26   ` Zi Yan
2022-04-26 21:08     ` Qian Cai
2022-04-26 21:38       ` Zi Yan
2022-04-27 12:41         ` Qian Cai
2022-04-27 13:10         ` Qian Cai
2022-04-27 13:27         ` Qian Cai
2022-04-27 13:30           ` Zi Yan
2022-04-27 21:04             ` Zi Yan
2022-04-28 12:33               ` Qian Cai
2022-04-28 12:39                 ` Zi Yan
2022-04-28 16:19                   ` Qian Cai
2022-04-29 13:38                     ` Zi Yan
2022-05-19 20:57                   ` Qian Cai
2022-05-19 21:35                     ` Zi Yan
2022-05-19 23:24                       ` Zi Yan
2022-05-20 11:30                       ` Qian Cai
2022-05-20 13:43                         ` Zi Yan
2022-05-20 14:13                           ` Zi Yan
2022-05-20 19:41                             ` Qian Cai
2022-05-20 21:56                               ` Zi Yan
2022-05-20 23:41                                 ` Qian Cai
2022-05-22 16:54                                   ` Zi Yan
2022-05-22 19:33                                     ` Zi Yan
2022-05-24 16:59                                     ` Qian Cai
2022-05-10  1:03 ` Andrew Morton
2022-05-10  1:03   ` Andrew Morton
2022-05-10  1:07   ` Zi Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.