linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Enable >0 order folio memory compaction
@ 2023-09-12 16:28 Zi Yan
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-12 16:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Zi Yan, Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1]. It is on top of
mm-everything-2023-09-11-22-56.

Overview
===

To support >0 order folio compaction, the patchset changes how free pages used
for migration are kept during compaction. Free pages used to be split into
order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
page order stored in page->private is zeroed, and page reference is set to 1).
Now all free pages are kept in a MAX_ORDER+1 array of page lists based
on their order without post allocation process. When migrate_pages() asks for
a new page, one of the free pages, based on the requested page order, is
then processed and given out.


Optimizations
===

1. Free page split is added to increase migration success rate in case
a source page does not have a matched free page in the free page lists.
Free page merge is possible but not implemented, since existing
PFN-based buddy page merge algorithm requires the identification of
buddy pages, but free pages kept for memory compaction cannot have
PageBuddy set to avoid confusing other PFN scanners.

2. Sort source pages in ascending order before migration is added to
reduce free page split. Otherwise, high order free pages might be
prematurely split, causing undesired high order folio migration failures.


TODOs
===

1. Refactor free page post allocation and free page preparation code so
that compaction_alloc() and compaction_free() can call functions instead
of hard coding.

2. One possible optimization is to allow migrate_pages() to continue
even if get_new_folio() returns a NULL. In general, that means there is
not enough memory. But in >0 order folio compaction case, that means
there is no suitable free page at source page order. It might be better
to skip that page and finish the rest of migration to achieve a better
compaction result.

3. Another possible optimization is to enable free page merge. It is
possible that a to-be-migrated page causes free page split then fails to
migrate eventually. We would lose a high order free page without free
page merge function. But a way of identifying free pages for memory
compaction is needed to reuse existing PFN-based buddy page merge.

4. The implemented >0 order folio compaction algorithm is quite naive
and does not consider all possible situations. A better algorithm can
improve compaction success rate.


Feel free to give comments and ask questions.

Thanks.


[1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/

Zi Yan (4):
  mm/compaction: add support for >0 order folio memory compaction.
  mm/compaction: optimize >0 order folio compaction with free page
    split.
  mm/compaction: optimize >0 order folio compaction by sorting source
    pages.
  mm/compaction: enable compacting >0 order folios.

 mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
 mm/internal.h   |   7 +-
 2 files changed, 176 insertions(+), 36 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
@ 2023-09-12 16:28 ` Zi Yan
  2023-09-12 17:32   ` Johannes Weiner
                     ` (2 more replies)
  2023-09-12 16:28 ` [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split Zi Yan
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-12 16:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Zi Yan, Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

From: Zi Yan <ziy@nvidia.com>

Before, memory compaction only migrates order-0 folios and skips >0 order
folios. This commit adds support for >0 order folio compaction by keeping
isolated free pages at their original size without splitting them into
order-0 pages and using them directly during migration process.

What is different from the prior implementation:
1. All isolated free pages are kept in a MAX_ORDER+1 array of page lists,
   where each page list stores free pages in the same order.
2. All free pages are not post_alloc_hook() processed nor buddy pages,
   although their orders are stored in first page's private like buddy
   pages.
3. During migration, in new page allocation time (i.e., in
   compaction_alloc()), free pages are then processed by post_alloc_hook().
   When migration fails and a new page is returned (i.e., in
   compaction_free()), free pages are restored by reversing the
   post_alloc_hook() operations.

Step 3 is done for a latter optimization that splitting and/or merging free
pages during compaction becomes easier.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/compaction.c | 108 +++++++++++++++++++++++++++++++++++++++---------
 mm/internal.h   |   7 +++-
 2 files changed, 94 insertions(+), 21 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 01ba298739dd..868e92e55d27 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -107,6 +107,44 @@ static void split_map_pages(struct list_head *list)
 	list_splice(&tmp_list, list);
 }
 
+static unsigned long release_free_list(struct free_list *freepages)
+{
+	int order;
+	unsigned long high_pfn = 0;
+
+	for (order = 0; order <= MAX_ORDER; order++) {
+		struct page *page, *next;
+
+		list_for_each_entry_safe(page, next, &freepages[order].pages, lru) {
+			unsigned long pfn = page_to_pfn(page);
+
+			list_del(&page->lru);
+			/*
+			 * Convert free pages into post allocation pages, so
+			 * that we can free them via __free_page.
+			 */
+			post_alloc_hook(page, order, __GFP_MOVABLE);
+			__free_pages(page, order);
+			if (pfn > high_pfn)
+				high_pfn = pfn;
+		}
+	}
+	return high_pfn;
+}
+
+static void sort_free_pages(struct list_head *src, struct free_list *dst)
+{
+	unsigned int order;
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, src, lru) {
+		order = buddy_order(page);
+
+		list_move(&page->lru, &dst[order].pages);
+		dst[order].nr_free++;
+	}
+}
+
 #ifdef CONFIG_COMPACTION
 bool PageMovable(struct page *page)
 {
@@ -1422,6 +1460,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
 {
 	unsigned long start_pfn, end_pfn;
 	struct page *page;
+	LIST_HEAD(freelist);
 
 	/* Do not search around if there are enough pages already */
 	if (cc->nr_freepages >= cc->nr_migratepages)
@@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
 	if (!page)
 		return;
 
-	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
+	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
+	sort_free_pages(&freelist, cc->freepages);
 
 	/* Skip this pageblock in the future as it's full or nearly full */
 	if (start_pfn == end_pfn && !cc->no_set_skip_hint)
@@ -1568,7 +1608,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				nr_scanned += nr_isolated - 1;
 				total_isolated += nr_isolated;
 				cc->nr_freepages += nr_isolated;
-				list_add_tail(&page->lru, &cc->freepages);
+				list_add_tail(&page->lru, &cc->freepages[order].pages);
 				count_compact_events(COMPACTISOLATED, nr_isolated);
 			} else {
 				/* If isolation fails, abort the search */
@@ -1642,13 +1682,13 @@ static void isolate_freepages(struct compact_control *cc)
 	unsigned long isolate_start_pfn; /* exact pfn we start at */
 	unsigned long block_end_pfn;	/* end of current pageblock */
 	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
-	struct list_head *freelist = &cc->freepages;
 	unsigned int stride;
+	LIST_HEAD(freelist);
 
 	/* Try a small search of the free lists for a candidate */
 	fast_isolate_freepages(cc);
 	if (cc->nr_freepages)
-		goto splitmap;
+		return;
 
 	/*
 	 * Initialise the free scanner. The starting point is where we last
@@ -1708,7 +1748,8 @@ static void isolate_freepages(struct compact_control *cc)
 
 		/* Found a block suitable for isolating free pages from. */
 		nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
-					block_end_pfn, freelist, stride, false);
+					block_end_pfn, &freelist, stride, false);
+		sort_free_pages(&freelist, cc->freepages);
 
 		/* Update the skip hint if the full pageblock was scanned */
 		if (isolate_start_pfn == block_end_pfn)
@@ -1749,10 +1790,6 @@ static void isolate_freepages(struct compact_control *cc)
 	 * and the loop terminated due to isolate_start_pfn < low_pfn
 	 */
 	cc->free_pfn = isolate_start_pfn;
-
-splitmap:
-	/* __isolate_free_page() does not map the pages */
-	split_map_pages(freelist);
 }
 
 /*
@@ -1763,18 +1800,21 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 {
 	struct compact_control *cc = (struct compact_control *)data;
 	struct folio *dst;
+	int order = folio_order(src);
 
-	if (list_empty(&cc->freepages)) {
+	if (!cc->freepages[order].nr_free) {
 		isolate_freepages(cc);
-
-		if (list_empty(&cc->freepages))
+		if (!cc->freepages[order].nr_free)
 			return NULL;
 	}
 
-	dst = list_entry(cc->freepages.next, struct folio, lru);
+	dst = list_first_entry(&cc->freepages[order].pages, struct folio, lru);
+	cc->freepages[order].nr_free--;
 	list_del(&dst->lru);
-	cc->nr_freepages--;
-
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
+	if (order)
+		prep_compound_page(&dst->page, order);
+	cc->nr_freepages -= 1 << order;
 	return dst;
 }
 
@@ -1786,9 +1826,34 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 static void compaction_free(struct folio *dst, unsigned long data)
 {
 	struct compact_control *cc = (struct compact_control *)data;
+	int order = folio_order(dst);
+	struct page *page = &dst->page;
 
-	list_add(&dst->lru, &cc->freepages);
-	cc->nr_freepages++;
+	if (order) {
+		int i;
+
+		page[1].flags &= ~PAGE_FLAGS_SECOND;
+		for (i = 1; i < (1 << order); i++) {
+			page[i].mapping = NULL;
+			clear_compound_head(&page[i]);
+			page[i].flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		}
+
+	}
+	/* revert post_alloc_hook() operations */
+	page->mapping = NULL;
+	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	set_page_count(page, 0);
+	page_mapcount_reset(page);
+	reset_page_owner(page, order);
+	page_table_check_free(page, order);
+	arch_free_page(page, order);
+	set_page_private(page, order);
+	INIT_LIST_HEAD(&dst->lru);
+
+	list_add(&dst->lru, &cc->freepages[order].pages);
+	cc->freepages[order].nr_free++;
+	cc->nr_freepages += 1 << order;
 }
 
 /* possible outcome of isolate_migratepages */
@@ -2412,6 +2477,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 	bool update_cached;
 	unsigned int nr_succeeded = 0;
+	int order;
 
 	/*
 	 * These counters track activities during zone compaction.  Initialize
@@ -2421,7 +2487,10 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	cc->total_free_scanned = 0;
 	cc->nr_migratepages = 0;
 	cc->nr_freepages = 0;
-	INIT_LIST_HEAD(&cc->freepages);
+	for (order = 0; order <= MAX_ORDER; order++) {
+		INIT_LIST_HEAD(&cc->freepages[order].pages);
+		cc->freepages[order].nr_free = 0;
+	}
 	INIT_LIST_HEAD(&cc->migratepages);
 
 	cc->migratetype = gfp_migratetype(cc->gfp_mask);
@@ -2607,7 +2676,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	 * so we don't leave any returned pages behind in the next attempt.
 	 */
 	if (cc->nr_freepages > 0) {
-		unsigned long free_pfn = release_freepages(&cc->freepages);
+		unsigned long free_pfn = release_free_list(cc->freepages);
 
 		cc->nr_freepages = 0;
 		VM_BUG_ON(free_pfn == 0);
@@ -2626,7 +2695,6 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 
 	trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret);
 
-	VM_BUG_ON(!list_empty(&cc->freepages));
 	VM_BUG_ON(!list_empty(&cc->migratepages));
 
 	return ret;
diff --git a/mm/internal.h b/mm/internal.h
index 8c90e966e9f8..f5c691bb5c1c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -465,6 +465,11 @@ int split_free_page(struct page *free_page,
 /*
  * in mm/compaction.c
  */
+
+struct free_list {
+	struct list_head	pages;
+	unsigned long		nr_free;
+};
 /*
  * compact_control is used to track pages being migrated and the free pages
  * they are being migrated to during memory compaction. The free_pfn starts
@@ -473,7 +478,7 @@ int split_free_page(struct page *free_page,
  * completes when free_pfn <= migrate_pfn
  */
 struct compact_control {
-	struct list_head freepages;	/* List of free pages to migrate to */
+	struct free_list freepages[MAX_ORDER + 1];	/* List of free pages to migrate to */
 	struct list_head migratepages;	/* List of pages being migrated */
 	unsigned int nr_freepages;	/* Number of isolated free pages */
 	unsigned int nr_migratepages;	/* Number of pages to migrate */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
@ 2023-09-12 16:28 ` Zi Yan
  2023-09-18  7:34   ` Baolin Wang
  2023-09-12 16:28 ` [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages Zi Yan
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-09-12 16:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Zi Yan, Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

From: Zi Yan <ziy@nvidia.com>

During migration in a memory compaction, free pages are placed in an array
of page lists based on their order. But the desired free page order (i.e.,
the order of a source page) might not be always present, thus leading to
migration failures. Split a high order free pages when source migration
page has a lower order to increase migration successful rate.

Note: merging free pages when a migration fails and a lower order free
page is returned via compaction_free() is possible, but there is too much
work. Since the free pages are not buddy pages, it is hard to identify
these free pages using existing PFN-based page merging algorithm.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 868e92e55d27..45747ab5f380 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 	struct compact_control *cc = (struct compact_control *)data;
 	struct folio *dst;
 	int order = folio_order(src);
+	bool has_isolated_pages = false;
 
+again:
 	if (!cc->freepages[order].nr_free) {
-		isolate_freepages(cc);
+		int i;
+
+		for (i = order + 1; i <= MAX_ORDER; i++) {
+			if (cc->freepages[i].nr_free) {
+				struct page *freepage =
+					list_first_entry(&cc->freepages[i].pages,
+							 struct page, lru);
+
+				int start_order = i;
+				unsigned long size = 1 << start_order;
+
+				list_del(&freepage->lru);
+				cc->freepages[i].nr_free--;
+
+				while (start_order > order) {
+					start_order--;
+					size >>= 1;
+
+					list_add(&freepage[size].lru,
+						&cc->freepages[start_order].pages);
+					cc->freepages[start_order].nr_free++;
+					set_page_private(&freepage[size], start_order);
+				}
+				post_alloc_hook(freepage, order, __GFP_MOVABLE);
+				if (order)
+					prep_compound_page(freepage, order);
+				dst = page_folio(freepage);
+				goto done;
+			}
+		}
+		if (!has_isolated_pages) {
+			isolate_freepages(cc);
+			has_isolated_pages = true;
+			goto again;
+		}
+
 		if (!cc->freepages[order].nr_free)
 			return NULL;
 	}
@@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
 	if (order)
 		prep_compound_page(&dst->page, order);
+done:
 	cc->nr_freepages -= 1 << order;
 	return dst;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages.
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
  2023-09-12 16:28 ` [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split Zi Yan
@ 2023-09-12 16:28 ` Zi Yan
  2023-09-12 17:56   ` Johannes Weiner
  2023-09-12 16:28 ` [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios Zi Yan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-09-12 16:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Zi Yan, Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

From: Zi Yan <ziy@nvidia.com>

It should maximize high order free page use and minimize free page splits.
It might be useful before free page merging is implemented.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/compaction.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 45747ab5f380..4300d877b824 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -145,6 +145,38 @@ static void sort_free_pages(struct list_head *src, struct free_list *dst)
 	}
 }
 
+static void sort_folios_by_order(struct list_head *pages)
+{
+	struct free_list page_list[MAX_ORDER + 1];
+	int order;
+	struct folio *folio, *next;
+
+	for (order = 0; order <= MAX_ORDER; order++) {
+		INIT_LIST_HEAD(&page_list[order].pages);
+		page_list[order].nr_free = 0;
+	}
+
+	list_for_each_entry_safe(folio, next, pages, lru) {
+		order = folio_order(folio);
+
+		if (order > MAX_ORDER)
+			continue;
+
+		list_move(&folio->lru, &page_list[order].pages);
+		page_list[order].nr_free++;
+	}
+
+	for (order = MAX_ORDER; order >= 0; order--) {
+		if (page_list[order].nr_free) {
+
+			list_for_each_entry_safe(folio, next,
+						 &page_list[order].pages, lru) {
+				list_move_tail(&folio->lru, pages);
+			}
+		}
+	}
+}
+
 #ifdef CONFIG_COMPACTION
 bool PageMovable(struct page *page)
 {
@@ -2636,6 +2668,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 				pageblock_start_pfn(cc->migrate_pfn - 1));
 		}
 
+		sort_folios_by_order(&cc->migratepages);
+
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
 				MR_COMPACTION, &nr_succeeded);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
                   ` (2 preceding siblings ...)
  2023-09-12 16:28 ` [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages Zi Yan
@ 2023-09-12 16:28 ` Zi Yan
  2023-09-15  9:41   ` Baolin Wang
  2023-09-20 14:44   ` kernel test robot
  2023-09-21  0:55 ` [RFC PATCH 0/4] Enable >0 order folio memory compaction Luis Chamberlain
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-12 16:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Zi Yan, Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

From: Zi Yan <ziy@nvidia.com>

Since compaction code can compact >0 order folios, enable it during the
process.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/compaction.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4300d877b824..f72af74094de 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (PageCompound(page) && !cc->alloc_contig) {
 			const unsigned int order = compound_order(page);
 
-			if (likely(order <= MAX_ORDER)) {
-				low_pfn += (1UL << order) - 1;
-				nr_scanned += (1UL << order) - 1;
+			/*
+			 * Compacting > pageblock_order pages does not improve
+			 * memory fragmentation. Also skip hugetlbfs pages.
+			 */
+			if (likely(order >= pageblock_order) || PageHuge(page)) {
+				if (order <= MAX_ORDER) {
+					low_pfn += (1UL << order) - 1;
+					nr_scanned += (1UL << order) - 1;
+				}
+				goto isolate_fail;
 			}
-			goto isolate_fail;
 		}
 
 		/*
@@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					goto isolate_abort;
 				}
 			}
-
-			/*
-			 * folio become large since the non-locked check,
-			 * and it's on LRU.
-			 */
-			if (unlikely(folio_test_large(folio) && !cc->alloc_contig)) {
-				low_pfn += folio_nr_pages(folio) - 1;
-				nr_scanned += folio_nr_pages(folio) - 1;
-				folio_set_lru(folio);
-				goto isolate_fail_put;
-			}
 		}
 
 		/* The folio is taken off the LRU */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
@ 2023-09-12 17:32   ` Johannes Weiner
  2023-09-12 17:38     ` Zi Yan
  2023-09-15  9:33   ` Baolin Wang
  2023-10-10  8:07   ` Huang, Ying
  2 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2023-09-12 17:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Zi Yan, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

On Tue, Sep 12, 2023 at 12:28:12PM -0400, Zi Yan wrote:
> @@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>  	if (!page)
>  		return;
>  
> -	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
> +	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
> +	sort_free_pages(&freelist, cc->freepages);

Can you make isolate_freepages_block() put the pages directly into a
sorted struct free_list?

AFAICS, the only place that doesn't technically need it is
isolate_freepages_range(). But that's then also the sole caller of
split_map_pages(), which can be made to work on struct free_list too
without notable overhead.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-12 17:32   ` Johannes Weiner
@ 2023-09-12 17:38     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-12 17:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Baolin Wang, Kemeng Shi, Mel Gorman,
	Rohan Puri, Mcgrof Chamberlain, Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 860 bytes --]

On 12 Sep 2023, at 13:32, Johannes Weiner wrote:

> On Tue, Sep 12, 2023 at 12:28:12PM -0400, Zi Yan wrote:
>> @@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>>  	if (!page)
>>  		return;
>>
>> -	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
>> +	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
>> +	sort_free_pages(&freelist, cc->freepages);
>
> Can you make isolate_freepages_block() put the pages directly into a
> sorted struct free_list?
>
> AFAICS, the only place that doesn't technically need it is
> isolate_freepages_range(). But that's then also the sole caller of
> split_map_pages(), which can be made to work on struct free_list too
> without notable overhead.

Sure. Will do that in the next version.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages.
  2023-09-12 16:28 ` [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages Zi Yan
@ 2023-09-12 17:56   ` Johannes Weiner
  2023-09-12 20:31     ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2023-09-12 17:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Zi Yan, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

On Tue, Sep 12, 2023 at 12:28:14PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It should maximize high order free page use and minimize free page splits.
> It might be useful before free page merging is implemented.

The premise sounds reasonable to me: start with the largest chunks in
the hope of producing the desired block size before having to piece
things together from the order-0s dribbles.

> @@ -145,6 +145,38 @@ static void sort_free_pages(struct list_head *src, struct free_list *dst)
>  	}
>  }
>  
> +static void sort_folios_by_order(struct list_head *pages)
> +{
> +	struct free_list page_list[MAX_ORDER + 1];
> +	int order;
> +	struct folio *folio, *next;
> +
> +	for (order = 0; order <= MAX_ORDER; order++) {
> +		INIT_LIST_HEAD(&page_list[order].pages);
> +		page_list[order].nr_free = 0;
> +	}
> +
> +	list_for_each_entry_safe(folio, next, pages, lru) {
> +		order = folio_order(folio);
> +
> +		if (order > MAX_ORDER)
> +			continue;
> +
> +		list_move(&folio->lru, &page_list[order].pages);
> +		page_list[order].nr_free++;
> +	}
> +
> +	for (order = MAX_ORDER; order >= 0; order--) {
> +		if (page_list[order].nr_free) {
> +
> +			list_for_each_entry_safe(folio, next,
> +						 &page_list[order].pages, lru) {
> +				list_move_tail(&folio->lru, pages);
> +			}
> +		}
> +	}
> +}
> +
>  #ifdef CONFIG_COMPACTION
>  bool PageMovable(struct page *page)
>  {
> @@ -2636,6 +2668,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  				pageblock_start_pfn(cc->migrate_pfn - 1));
>  		}
>  
> +		sort_folios_by_order(&cc->migratepages);

Would it make sense to have isolate_migratepages_block() produce a
sorted list already? By collecting into a struct free_list in there
and finishing with that `for (order = MAX...) list_add_tail()' loop.

That would save quite a bit of shuffling around. Compaction can be
hot, and is expected to get hotter with growing larger order pressure.

The contig allocator doesn't care about ordering, but it should be
possible to gate the sorting reasonably on !cc->alloc_contig.

An optimization down the line could be to skip the sorted list
assembly for the compaction case entirely, have compact_zone() work
directly on struct free_list, starting with the highest order and
checking compact_finished() in between orders.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages.
  2023-09-12 17:56   ` Johannes Weiner
@ 2023-09-12 20:31     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-12 20:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Baolin Wang, Kemeng Shi, Mel Gorman,
	Rohan Puri, Mcgrof Chamberlain, Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 3363 bytes --]

On 12 Sep 2023, at 13:56, Johannes Weiner wrote:

> On Tue, Sep 12, 2023 at 12:28:14PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It should maximize high order free page use and minimize free page splits.
>> It might be useful before free page merging is implemented.
>
> The premise sounds reasonable to me: start with the largest chunks in
> the hope of producing the desired block size before having to piece
> things together from the order-0s dribbles.
>
>> @@ -145,6 +145,38 @@ static void sort_free_pages(struct list_head *src, struct free_list *dst)
>>  	}
>>  }
>>
>> +static void sort_folios_by_order(struct list_head *pages)
>> +{
>> +	struct free_list page_list[MAX_ORDER + 1];
>> +	int order;
>> +	struct folio *folio, *next;
>> +
>> +	for (order = 0; order <= MAX_ORDER; order++) {
>> +		INIT_LIST_HEAD(&page_list[order].pages);
>> +		page_list[order].nr_free = 0;
>> +	}
>> +
>> +	list_for_each_entry_safe(folio, next, pages, lru) {
>> +		order = folio_order(folio);
>> +
>> +		if (order > MAX_ORDER)
>> +			continue;
>> +
>> +		list_move(&folio->lru, &page_list[order].pages);
>> +		page_list[order].nr_free++;
>> +	}
>> +
>> +	for (order = MAX_ORDER; order >= 0; order--) {
>> +		if (page_list[order].nr_free) {
>> +
>> +			list_for_each_entry_safe(folio, next,
>> +						 &page_list[order].pages, lru) {
>> +				list_move_tail(&folio->lru, pages);
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>  #ifdef CONFIG_COMPACTION
>>  bool PageMovable(struct page *page)
>>  {
>> @@ -2636,6 +2668,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>  				pageblock_start_pfn(cc->migrate_pfn - 1));
>>  		}
>>
>> +		sort_folios_by_order(&cc->migratepages);
>
> Would it make sense to have isolate_migratepages_block() produce a
> sorted list already? By collecting into a struct free_list in there
> and finishing with that `for (order = MAX...) list_add_tail()' loop.
>
> That would save quite a bit of shuffling around. Compaction can be
> hot, and is expected to get hotter with growing larger order pressure.

Yes, that sounds reasonable. Will do that in the next version.

>
> The contig allocator doesn't care about ordering, but it should be
> possible to gate the sorting reasonably on !cc->alloc_contig.

Right. For !cc->alloc_contig, pages are put in struct free_list and
later sorted and moved to cc->migratepages. For cc->alloc_contig,
pages are directly put on cc->migratepages.

>
> An optimization down the line could be to skip the sorted list
> assembly for the compaction case entirely, have compact_zone() work
> directly on struct free_list, starting with the highest order and
> checking compact_finished() in between orders.

Sounds reasonable. It actually makes me think more and realize sorting
source pages might not be optimal all the time. Because in general
migrating higher order folios first would generate larger free spaces,
which might meet the compaction goal faster. But in some cases, the target
order, e.g., order 4, free pages can be generated by migrating one order-3
page and other 8 order-0 pages around. This means we might waste effort by
migrating all high order page first. I guess there will be a lot of possible
optimizations for different situations. :)


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
  2023-09-12 17:32   ` Johannes Weiner
@ 2023-09-15  9:33   ` Baolin Wang
  2023-09-18 17:06     ` Zi Yan
  2023-10-10  8:07   ` Huang, Ying
  2 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2023-09-15  9:33 UTC (permalink / raw)
  To: Zi Yan, linux-mm, linux-kernel
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard



On 9/13/2023 12:28 AM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Before, memory compaction only migrates order-0 folios and skips >0 order
> folios. This commit adds support for >0 order folio compaction by keeping
> isolated free pages at their original size without splitting them into
> order-0 pages and using them directly during migration process.
> 
> What is different from the prior implementation:
> 1. All isolated free pages are kept in a MAX_ORDER+1 array of page lists,
>     where each page list stores free pages in the same order.
> 2. All free pages are not post_alloc_hook() processed nor buddy pages,
>     although their orders are stored in first page's private like buddy
>     pages.
> 3. During migration, in new page allocation time (i.e., in
>     compaction_alloc()), free pages are then processed by post_alloc_hook().
>     When migration fails and a new page is returned (i.e., in
>     compaction_free()), free pages are restored by reversing the
>     post_alloc_hook() operations.
> 
> Step 3 is done for a latter optimization that splitting and/or merging free
> pages during compaction becomes easier.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/compaction.c | 108 +++++++++++++++++++++++++++++++++++++++---------
>   mm/internal.h   |   7 +++-
>   2 files changed, 94 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 01ba298739dd..868e92e55d27 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -107,6 +107,44 @@ static void split_map_pages(struct list_head *list)
>   	list_splice(&tmp_list, list);
>   }
>   
> +static unsigned long release_free_list(struct free_list *freepages)
> +{
> +	int order;
> +	unsigned long high_pfn = 0;
> +
> +	for (order = 0; order <= MAX_ORDER; order++) {
> +		struct page *page, *next;
> +
> +		list_for_each_entry_safe(page, next, &freepages[order].pages, lru) {
> +			unsigned long pfn = page_to_pfn(page);
> +
> +			list_del(&page->lru);
> +			/*
> +			 * Convert free pages into post allocation pages, so
> +			 * that we can free them via __free_page.
> +			 */
> +			post_alloc_hook(page, order, __GFP_MOVABLE);
> +			__free_pages(page, order);
> +			if (pfn > high_pfn)
> +				high_pfn = pfn;
> +		}
> +	}
> +	return high_pfn;
> +}
> +
> +static void sort_free_pages(struct list_head *src, struct free_list *dst)
> +{
> +	unsigned int order;
> +	struct page *page, *next;
> +
> +	list_for_each_entry_safe(page, next, src, lru) {
> +		order = buddy_order(page);

These pages are already isolated from the buddy system, but continue to 
use buddy_order() to get the page order, which can be confused. 
Moreover, the buddy_order() should be under the zone lock protection.

IMO, just use 'page_private()' to get the page order like 
split_map_pages() already did, that seems more readable.

> +
> +		list_move(&page->lru, &dst[order].pages);
> +		dst[order].nr_free++;
> +	}
> +}
> +
>   #ifdef CONFIG_COMPACTION
>   bool PageMovable(struct page *page)
>   {
> @@ -1422,6 +1460,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>   {
>   	unsigned long start_pfn, end_pfn;
>   	struct page *page;
> +	LIST_HEAD(freelist);
>   
>   	/* Do not search around if there are enough pages already */
>   	if (cc->nr_freepages >= cc->nr_migratepages)
> @@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>   	if (!page)
>   		return;
>   
> -	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
> +	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
> +	sort_free_pages(&freelist, cc->freepages);
>   
>   	/* Skip this pageblock in the future as it's full or nearly full */
>   	if (start_pfn == end_pfn && !cc->no_set_skip_hint)
> @@ -1568,7 +1608,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
>   				nr_scanned += nr_isolated - 1;
>   				total_isolated += nr_isolated;
>   				cc->nr_freepages += nr_isolated;
> -				list_add_tail(&page->lru, &cc->freepages);
> +				list_add_tail(&page->lru, &cc->freepages[order].pages);

Missed to update cc->freepages[order].nr_free?

>   				count_compact_events(COMPACTISOLATED, nr_isolated);
>   			} else {
>   				/* If isolation fails, abort the search */
> @@ -1642,13 +1682,13 @@ static void isolate_freepages(struct compact_control *cc)
>   	unsigned long isolate_start_pfn; /* exact pfn we start at */
>   	unsigned long block_end_pfn;	/* end of current pageblock */
>   	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
> -	struct list_head *freelist = &cc->freepages;
>   	unsigned int stride;
> +	LIST_HEAD(freelist);
>   
>   	/* Try a small search of the free lists for a candidate */
>   	fast_isolate_freepages(cc);
>   	if (cc->nr_freepages)
> -		goto splitmap;
> +		return;
>   
>   	/*
>   	 * Initialise the free scanner. The starting point is where we last
> @@ -1708,7 +1748,8 @@ static void isolate_freepages(struct compact_control *cc)
>   
>   		/* Found a block suitable for isolating free pages from. */
>   		nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> -					block_end_pfn, freelist, stride, false);
> +					block_end_pfn, &freelist, stride, false);
> +		sort_free_pages(&freelist, cc->freepages);
>   
>   		/* Update the skip hint if the full pageblock was scanned */
>   		if (isolate_start_pfn == block_end_pfn)
> @@ -1749,10 +1790,6 @@ static void isolate_freepages(struct compact_control *cc)
>   	 * and the loop terminated due to isolate_start_pfn < low_pfn
>   	 */
>   	cc->free_pfn = isolate_start_pfn;
> -
> -splitmap:
> -	/* __isolate_free_page() does not map the pages */
> -	split_map_pages(freelist);
>   }
>   
>   /*
> @@ -1763,18 +1800,21 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>   {
>   	struct compact_control *cc = (struct compact_control *)data;
>   	struct folio *dst;
> +	int order = folio_order(src);
>   
> -	if (list_empty(&cc->freepages)) {
> +	if (!cc->freepages[order].nr_free) {
>   		isolate_freepages(cc);
> -
> -		if (list_empty(&cc->freepages))
> +		if (!cc->freepages[order].nr_free)
>   			return NULL;
>   	}
>   
> -	dst = list_entry(cc->freepages.next, struct folio, lru);
> +	dst = list_first_entry(&cc->freepages[order].pages, struct folio, lru);
> +	cc->freepages[order].nr_free--;
>   	list_del(&dst->lru);
> -	cc->nr_freepages--;
> -
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> +	if (order)
> +		prep_compound_page(&dst->page, order);
> +	cc->nr_freepages -= 1 << order;
>   	return dst;
>   }
>   
> @@ -1786,9 +1826,34 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>   static void compaction_free(struct folio *dst, unsigned long data)
>   {
>   	struct compact_control *cc = (struct compact_control *)data;
> +	int order = folio_order(dst);
> +	struct page *page = &dst->page;
>   
> -	list_add(&dst->lru, &cc->freepages);
> -	cc->nr_freepages++;
> +	if (order) {
> +		int i;
> +
> +		page[1].flags &= ~PAGE_FLAGS_SECOND;
> +		for (i = 1; i < (1 << order); i++) {
> +			page[i].mapping = NULL;
> +			clear_compound_head(&page[i]);
> +			page[i].flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		}
> +
> +	}
> +	/* revert post_alloc_hook() operations */
> +	page->mapping = NULL;
> +	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +	set_page_count(page, 0);
> +	page_mapcount_reset(page);
> +	reset_page_owner(page, order);
> +	page_table_check_free(page, order);
> +	arch_free_page(page, order);
> +	set_page_private(page, order);
> +	INIT_LIST_HEAD(&dst->lru);
> +
> +	list_add(&dst->lru, &cc->freepages[order].pages);
> +	cc->freepages[order].nr_free++;
> +	cc->nr_freepages += 1 << order;
>   }
>   
>   /* possible outcome of isolate_migratepages */
> @@ -2412,6 +2477,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>   	const bool sync = cc->mode != MIGRATE_ASYNC;
>   	bool update_cached;
>   	unsigned int nr_succeeded = 0;
> +	int order;
>   
>   	/*
>   	 * These counters track activities during zone compaction.  Initialize
> @@ -2421,7 +2487,10 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>   	cc->total_free_scanned = 0;
>   	cc->nr_migratepages = 0;
>   	cc->nr_freepages = 0;
> -	INIT_LIST_HEAD(&cc->freepages);
> +	for (order = 0; order <= MAX_ORDER; order++) {
> +		INIT_LIST_HEAD(&cc->freepages[order].pages);
> +		cc->freepages[order].nr_free = 0;
> +	}
>   	INIT_LIST_HEAD(&cc->migratepages);
>   
>   	cc->migratetype = gfp_migratetype(cc->gfp_mask);
> @@ -2607,7 +2676,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>   	 * so we don't leave any returned pages behind in the next attempt.
>   	 */
>   	if (cc->nr_freepages > 0) {
> -		unsigned long free_pfn = release_freepages(&cc->freepages);
> +		unsigned long free_pfn = release_free_list(cc->freepages);
>   
>   		cc->nr_freepages = 0;
>   		VM_BUG_ON(free_pfn == 0);
> @@ -2626,7 +2695,6 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>   
>   	trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret);
>   
> -	VM_BUG_ON(!list_empty(&cc->freepages));
>   	VM_BUG_ON(!list_empty(&cc->migratepages));
>   
>   	return ret;
> diff --git a/mm/internal.h b/mm/internal.h
> index 8c90e966e9f8..f5c691bb5c1c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -465,6 +465,11 @@ int split_free_page(struct page *free_page,
>   /*
>    * in mm/compaction.c
>    */
> +
> +struct free_list {
> +	struct list_head	pages;
> +	unsigned long		nr_free;
> +};
>   /*
>    * compact_control is used to track pages being migrated and the free pages
>    * they are being migrated to during memory compaction. The free_pfn starts
> @@ -473,7 +478,7 @@ int split_free_page(struct page *free_page,
>    * completes when free_pfn <= migrate_pfn
>    */
>   struct compact_control {
> -	struct list_head freepages;	/* List of free pages to migrate to */
> +	struct free_list freepages[MAX_ORDER + 1];	/* List of free pages to migrate to */
>   	struct list_head migratepages;	/* List of pages being migrated */
>   	unsigned int nr_freepages;	/* Number of isolated free pages */
>   	unsigned int nr_migratepages;	/* Number of pages to migrate */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.
  2023-09-12 16:28 ` [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios Zi Yan
@ 2023-09-15  9:41   ` Baolin Wang
  2023-09-18 17:17     ` Zi Yan
  2023-09-20 14:44   ` kernel test robot
  1 sibling, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2023-09-15  9:41 UTC (permalink / raw)
  To: Zi Yan, linux-mm, linux-kernel
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard



On 9/13/2023 12:28 AM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Since compaction code can compact >0 order folios, enable it during the
> process.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/compaction.c | 25 ++++++++++---------------
>   1 file changed, 10 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4300d877b824..f72af74094de 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   		if (PageCompound(page) && !cc->alloc_contig) {
>   			const unsigned int order = compound_order(page);
>   
> -			if (likely(order <= MAX_ORDER)) {
> -				low_pfn += (1UL << order) - 1;
> -				nr_scanned += (1UL << order) - 1;
> +			/*
> +			 * Compacting > pageblock_order pages does not improve
> +			 * memory fragmentation. Also skip hugetlbfs pages.
> +			 */
> +			if (likely(order >= pageblock_order) || PageHuge(page)) {

IMO, if the compound page order is larger than the requested cc->order, 
we should also fail the isolation, cause it also does not improve 
fragmentation, right?

> +				if (order <= MAX_ORDER) {
> +					low_pfn += (1UL << order) - 1;
> +					nr_scanned += (1UL << order) - 1;
> +				}
> +				goto isolate_fail;
>   			}
> -			goto isolate_fail;
>   		}
>   
>   		/*
> @@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					goto isolate_abort;
>   				}
>   			}
> -
> -			/*
> -			 * folio become large since the non-locked check,
> -			 * and it's on LRU.
> -			 */
> -			if (unlikely(folio_test_large(folio) && !cc->alloc_contig))  > -				low_pfn += folio_nr_pages(folio) - 1;
> -				nr_scanned += folio_nr_pages(folio) - 1;
> -				folio_set_lru(folio);
> -				goto isolate_fail_put;
> -			}

I do not think you can remove this validation, since previous validation 
is lockless. So under the lock, we need re-check if the compound page 
order is larger than pageblock_order or cc->order, that need fail to 
isolate.

>   		}
>   
>   		/* The folio is taken off the LRU */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.
  2023-09-12 16:28 ` [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split Zi Yan
@ 2023-09-18  7:34   ` Baolin Wang
  2023-09-18 17:20     ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2023-09-18  7:34 UTC (permalink / raw)
  To: Zi Yan, linux-mm, linux-kernel
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard



On 9/13/2023 12:28 AM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> During migration in a memory compaction, free pages are placed in an array
> of page lists based on their order. But the desired free page order (i.e.,
> the order of a source page) might not be always present, thus leading to
> migration failures. Split a high order free pages when source migration
> page has a lower order to increase migration successful rate.
> 
> Note: merging free pages when a migration fails and a lower order free
> page is returned via compaction_free() is possible, but there is too much
> work. Since the free pages are not buddy pages, it is hard to identify
> these free pages using existing PFN-based page merging algorithm.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 868e92e55d27..45747ab5f380 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>   	struct compact_control *cc = (struct compact_control *)data;
>   	struct folio *dst;
>   	int order = folio_order(src);
> +	bool has_isolated_pages = false;
>   
> +again:
>   	if (!cc->freepages[order].nr_free) {
> -		isolate_freepages(cc);
> +		int i;
> +
> +		for (i = order + 1; i <= MAX_ORDER; i++) {
> +			if (cc->freepages[i].nr_free) {
> +				struct page *freepage =
> +					list_first_entry(&cc->freepages[i].pages,
> +							 struct page, lru);
> +
> +				int start_order = i;
> +				unsigned long size = 1 << start_order;
> +
> +				list_del(&freepage->lru);
> +				cc->freepages[i].nr_free--;
> +
> +				while (start_order > order) {
> +					start_order--;
> +					size >>= 1;
> +
> +					list_add(&freepage[size].lru,
> +						&cc->freepages[start_order].pages);
> +					cc->freepages[start_order].nr_free++;
> +					set_page_private(&freepage[size], start_order);

IIUC, these split pages should also call functions to initialize? e.g. 
prep_compound_page()?

> +				}
> +				post_alloc_hook(freepage, order, __GFP_MOVABLE);
> +				if (order)
> +					prep_compound_page(freepage, order);
> +				dst = page_folio(freepage);
> +				goto done;
> +			}
> +		}
> +		if (!has_isolated_pages) {
> +			isolate_freepages(cc);
> +			has_isolated_pages = true;
> +			goto again;
> +		}
> +
>   		if (!cc->freepages[order].nr_free)
>   			return NULL;
>   	}
> @@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>   	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
>   	if (order)
>   		prep_compound_page(&dst->page, order);
> +done:
>   	cc->nr_freepages -= 1 << order;
>   	return dst;
>   }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-15  9:33   ` Baolin Wang
@ 2023-09-18 17:06     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-18 17:06 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Kemeng Shi,
	Mel Gorman, Rohan Puri, Mcgrof Chamberlain, Adam Manzanares,
	John Hubbard

[-- Attachment #1: Type: text/plain, Size: 10823 bytes --]

On 15 Sep 2023, at 5:33, Baolin Wang wrote:

> On 9/13/2023 12:28 AM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Before, memory compaction only migrates order-0 folios and skips >0 order
>> folios. This commit adds support for >0 order folio compaction by keeping
>> isolated free pages at their original size without splitting them into
>> order-0 pages and using them directly during migration process.
>>
>> What is different from the prior implementation:
>> 1. All isolated free pages are kept in a MAX_ORDER+1 array of page lists,
>>     where each page list stores free pages in the same order.
>> 2. All free pages are not post_alloc_hook() processed nor buddy pages,
>>     although their orders are stored in first page's private like buddy
>>     pages.
>> 3. During migration, in new page allocation time (i.e., in
>>     compaction_alloc()), free pages are then processed by post_alloc_hook().
>>     When migration fails and a new page is returned (i.e., in
>>     compaction_free()), free pages are restored by reversing the
>>     post_alloc_hook() operations.
>>
>> Step 3 is done for a latter optimization that splitting and/or merging free
>> pages during compaction becomes easier.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/compaction.c | 108 +++++++++++++++++++++++++++++++++++++++---------
>>   mm/internal.h   |   7 +++-
>>   2 files changed, 94 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 01ba298739dd..868e92e55d27 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -107,6 +107,44 @@ static void split_map_pages(struct list_head *list)
>>   	list_splice(&tmp_list, list);
>>   }
>>  +static unsigned long release_free_list(struct free_list *freepages)
>> +{
>> +	int order;
>> +	unsigned long high_pfn = 0;
>> +
>> +	for (order = 0; order <= MAX_ORDER; order++) {
>> +		struct page *page, *next;
>> +
>> +		list_for_each_entry_safe(page, next, &freepages[order].pages, lru) {
>> +			unsigned long pfn = page_to_pfn(page);
>> +
>> +			list_del(&page->lru);
>> +			/*
>> +			 * Convert free pages into post allocation pages, so
>> +			 * that we can free them via __free_page.
>> +			 */
>> +			post_alloc_hook(page, order, __GFP_MOVABLE);
>> +			__free_pages(page, order);
>> +			if (pfn > high_pfn)
>> +				high_pfn = pfn;
>> +		}
>> +	}
>> +	return high_pfn;
>> +}
>> +
>> +static void sort_free_pages(struct list_head *src, struct free_list *dst)
>> +{
>> +	unsigned int order;
>> +	struct page *page, *next;
>> +
>> +	list_for_each_entry_safe(page, next, src, lru) {
>> +		order = buddy_order(page);
>
> These pages are already isolated from the buddy system, but continue to use buddy_order() to get the page order, which can be confused. Moreover, the buddy_order() should be under the zone lock protection.
>
> IMO, just use 'page_private()' to get the page order like split_map_pages() already did, that seems more readable.

Sure. Will do in the next version. Thanks.

>
>> +
>> +		list_move(&page->lru, &dst[order].pages);
>> +		dst[order].nr_free++;
>> +	}
>> +}
>> +
>>   #ifdef CONFIG_COMPACTION
>>   bool PageMovable(struct page *page)
>>   {
>> @@ -1422,6 +1460,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>>   {
>>   	unsigned long start_pfn, end_pfn;
>>   	struct page *page;
>> +	LIST_HEAD(freelist);
>>    	/* Do not search around if there are enough pages already */
>>   	if (cc->nr_freepages >= cc->nr_migratepages)
>> @@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>>   	if (!page)
>>   		return;
>>  -	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
>> +	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
>> +	sort_free_pages(&freelist, cc->freepages);
>>    	/* Skip this pageblock in the future as it's full or nearly full */
>>   	if (start_pfn == end_pfn && !cc->no_set_skip_hint)
>> @@ -1568,7 +1608,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
>>   				nr_scanned += nr_isolated - 1;
>>   				total_isolated += nr_isolated;
>>   				cc->nr_freepages += nr_isolated;
>> -				list_add_tail(&page->lru, &cc->freepages);
>> +				list_add_tail(&page->lru, &cc->freepages[order].pages);
>
> Missed to update cc->freepages[order].nr_free?

Good catch. Thanks. Will fix it.

>
>>   				count_compact_events(COMPACTISOLATED, nr_isolated);
>>   			} else {
>>   				/* If isolation fails, abort the search */
>> @@ -1642,13 +1682,13 @@ static void isolate_freepages(struct compact_control *cc)
>>   	unsigned long isolate_start_pfn; /* exact pfn we start at */
>>   	unsigned long block_end_pfn;	/* end of current pageblock */
>>   	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
>> -	struct list_head *freelist = &cc->freepages;
>>   	unsigned int stride;
>> +	LIST_HEAD(freelist);
>>    	/* Try a small search of the free lists for a candidate */
>>   	fast_isolate_freepages(cc);
>>   	if (cc->nr_freepages)
>> -		goto splitmap;
>> +		return;
>>    	/*
>>   	 * Initialise the free scanner. The starting point is where we last
>> @@ -1708,7 +1748,8 @@ static void isolate_freepages(struct compact_control *cc)
>>    		/* Found a block suitable for isolating free pages from. */
>>   		nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
>> -					block_end_pfn, freelist, stride, false);
>> +					block_end_pfn, &freelist, stride, false);
>> +		sort_free_pages(&freelist, cc->freepages);
>>    		/* Update the skip hint if the full pageblock was scanned */
>>   		if (isolate_start_pfn == block_end_pfn)
>> @@ -1749,10 +1790,6 @@ static void isolate_freepages(struct compact_control *cc)
>>   	 * and the loop terminated due to isolate_start_pfn < low_pfn
>>   	 */
>>   	cc->free_pfn = isolate_start_pfn;
>> -
>> -splitmap:
>> -	/* __isolate_free_page() does not map the pages */
>> -	split_map_pages(freelist);
>>   }
>>    /*
>> @@ -1763,18 +1800,21 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>   {
>>   	struct compact_control *cc = (struct compact_control *)data;
>>   	struct folio *dst;
>> +	int order = folio_order(src);
>>  -	if (list_empty(&cc->freepages)) {
>> +	if (!cc->freepages[order].nr_free) {
>>   		isolate_freepages(cc);
>> -
>> -		if (list_empty(&cc->freepages))
>> +		if (!cc->freepages[order].nr_free)
>>   			return NULL;
>>   	}
>>  -	dst = list_entry(cc->freepages.next, struct folio, lru);
>> +	dst = list_first_entry(&cc->freepages[order].pages, struct folio, lru);
>> +	cc->freepages[order].nr_free--;
>>   	list_del(&dst->lru);
>> -	cc->nr_freepages--;
>> -
>> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
>> +	if (order)
>> +		prep_compound_page(&dst->page, order);
>> +	cc->nr_freepages -= 1 << order;
>>   	return dst;
>>   }
>>  @@ -1786,9 +1826,34 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>   static void compaction_free(struct folio *dst, unsigned long data)
>>   {
>>   	struct compact_control *cc = (struct compact_control *)data;
>> +	int order = folio_order(dst);
>> +	struct page *page = &dst->page;
>>  -	list_add(&dst->lru, &cc->freepages);
>> -	cc->nr_freepages++;
>> +	if (order) {
>> +		int i;
>> +
>> +		page[1].flags &= ~PAGE_FLAGS_SECOND;
>> +		for (i = 1; i < (1 << order); i++) {
>> +			page[i].mapping = NULL;
>> +			clear_compound_head(&page[i]);
>> +			page[i].flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> +		}
>> +
>> +	}
>> +	/* revert post_alloc_hook() operations */
>> +	page->mapping = NULL;
>> +	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> +	set_page_count(page, 0);
>> +	page_mapcount_reset(page);
>> +	reset_page_owner(page, order);
>> +	page_table_check_free(page, order);
>> +	arch_free_page(page, order);
>> +	set_page_private(page, order);
>> +	INIT_LIST_HEAD(&dst->lru);
>> +
>> +	list_add(&dst->lru, &cc->freepages[order].pages);
>> +	cc->freepages[order].nr_free++;
>> +	cc->nr_freepages += 1 << order;
>>   }
>>    /* possible outcome of isolate_migratepages */
>> @@ -2412,6 +2477,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	const bool sync = cc->mode != MIGRATE_ASYNC;
>>   	bool update_cached;
>>   	unsigned int nr_succeeded = 0;
>> +	int order;
>>    	/*
>>   	 * These counters track activities during zone compaction.  Initialize
>> @@ -2421,7 +2487,10 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	cc->total_free_scanned = 0;
>>   	cc->nr_migratepages = 0;
>>   	cc->nr_freepages = 0;
>> -	INIT_LIST_HEAD(&cc->freepages);
>> +	for (order = 0; order <= MAX_ORDER; order++) {
>> +		INIT_LIST_HEAD(&cc->freepages[order].pages);
>> +		cc->freepages[order].nr_free = 0;
>> +	}
>>   	INIT_LIST_HEAD(&cc->migratepages);
>>    	cc->migratetype = gfp_migratetype(cc->gfp_mask);
>> @@ -2607,7 +2676,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	 * so we don't leave any returned pages behind in the next attempt.
>>   	 */
>>   	if (cc->nr_freepages > 0) {
>> -		unsigned long free_pfn = release_freepages(&cc->freepages);
>> +		unsigned long free_pfn = release_free_list(cc->freepages);
>>    		cc->nr_freepages = 0;
>>   		VM_BUG_ON(free_pfn == 0);
>> @@ -2626,7 +2695,6 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>    	trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret);
>>  -	VM_BUG_ON(!list_empty(&cc->freepages));
>>   	VM_BUG_ON(!list_empty(&cc->migratepages));
>>    	return ret;
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 8c90e966e9f8..f5c691bb5c1c 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -465,6 +465,11 @@ int split_free_page(struct page *free_page,
>>   /*
>>    * in mm/compaction.c
>>    */
>> +
>> +struct free_list {
>> +	struct list_head	pages;
>> +	unsigned long		nr_free;
>> +};
>>   /*
>>    * compact_control is used to track pages being migrated and the free pages
>>    * they are being migrated to during memory compaction. The free_pfn starts
>> @@ -473,7 +478,7 @@ int split_free_page(struct page *free_page,
>>    * completes when free_pfn <= migrate_pfn
>>    */
>>   struct compact_control {
>> -	struct list_head freepages;	/* List of free pages to migrate to */
>> +	struct free_list freepages[MAX_ORDER + 1];	/* List of free pages to migrate to */
>>   	struct list_head migratepages;	/* List of pages being migrated */
>>   	unsigned int nr_freepages;	/* Number of isolated free pages */
>>   	unsigned int nr_migratepages;	/* Number of pages to migrate */


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.
  2023-09-15  9:41   ` Baolin Wang
@ 2023-09-18 17:17     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-18 17:17 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Kemeng Shi,
	Mel Gorman, Rohan Puri, Mcgrof Chamberlain, Adam Manzanares,
	John Hubbard

[-- Attachment #1: Type: text/plain, Size: 2683 bytes --]

On 15 Sep 2023, at 5:41, Baolin Wang wrote:

> On 9/13/2023 12:28 AM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Since compaction code can compact >0 order folios, enable it during the
>> process.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/compaction.c | 25 ++++++++++---------------
>>   1 file changed, 10 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 4300d877b824..f72af74094de 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   		if (PageCompound(page) && !cc->alloc_contig) {
>>   			const unsigned int order = compound_order(page);
>>  -			if (likely(order <= MAX_ORDER)) {
>> -				low_pfn += (1UL << order) - 1;
>> -				nr_scanned += (1UL << order) - 1;
>> +			/*
>> +			 * Compacting > pageblock_order pages does not improve
>> +			 * memory fragmentation. Also skip hugetlbfs pages.
>> +			 */
>> +			if (likely(order >= pageblock_order) || PageHuge(page)) {
>
> IMO, if the compound page order is larger than the requested cc->order, we should also fail the isolation, cause it also does not improve fragmentation, right?
>

Probably yes. I think the reasoning should be since compaction is asking for cc->order,
we should not compacting folios with orders larger than or equal to that, since
cc->order tells us the max free page order is smaller than it, otherwise the
allocation would happen already. I will add this condition in the next version.

>> +				if (order <= MAX_ORDER) {
>> +					low_pfn += (1UL << order) - 1;
>> +					nr_scanned += (1UL << order) - 1;
>> +				}
>> +				goto isolate_fail;
>>   			}
>> -			goto isolate_fail;
>>   		}
>>    		/*
>> @@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   					goto isolate_abort;
>>   				}
>>   			}
>> -
>> -			/*
>> -			 * folio become large since the non-locked check,
>> -			 * and it's on LRU.
>> -			 */
>> -			if (unlikely(folio_test_large(folio) && !cc->alloc_contig))  > -				low_pfn += folio_nr_pages(folio) - 1;
>> -				nr_scanned += folio_nr_pages(folio) - 1;
>> -				folio_set_lru(folio);
>> -				goto isolate_fail_put;
>> -			}
>
> I do not think you can remove this validation, since previous validation is lockless. So under the lock, we need re-check if the compound page order is larger than pageblock_order or cc->order, that need fail to isolate.

This check should go away, but a new order check for large folios should be
added. Will add it. Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.
  2023-09-18  7:34   ` Baolin Wang
@ 2023-09-18 17:20     ` Zi Yan
  2023-09-20  8:15       ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-09-18 17:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Kemeng Shi,
	Mel Gorman, Rohan Puri, Mcgrof Chamberlain, Adam Manzanares,
	John Hubbard

[-- Attachment #1: Type: text/plain, Size: 3334 bytes --]

On 18 Sep 2023, at 3:34, Baolin Wang wrote:

> On 9/13/2023 12:28 AM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> During migration in a memory compaction, free pages are placed in an array
>> of page lists based on their order. But the desired free page order (i.e.,
>> the order of a source page) might not be always present, thus leading to
>> migration failures. Split a high order free pages when source migration
>> page has a lower order to increase migration successful rate.
>>
>> Note: merging free pages when a migration fails and a lower order free
>> page is returned via compaction_free() is possible, but there is too much
>> work. Since the free pages are not buddy pages, it is hard to identify
>> these free pages using existing PFN-based page merging algorithm.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 868e92e55d27..45747ab5f380 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>   	struct compact_control *cc = (struct compact_control *)data;
>>   	struct folio *dst;
>>   	int order = folio_order(src);
>> +	bool has_isolated_pages = false;
>>  +again:
>>   	if (!cc->freepages[order].nr_free) {
>> -		isolate_freepages(cc);
>> +		int i;
>> +
>> +		for (i = order + 1; i <= MAX_ORDER; i++) {
>> +			if (cc->freepages[i].nr_free) {
>> +				struct page *freepage =
>> +					list_first_entry(&cc->freepages[i].pages,
>> +							 struct page, lru);
>> +
>> +				int start_order = i;
>> +				unsigned long size = 1 << start_order;
>> +
>> +				list_del(&freepage->lru);
>> +				cc->freepages[i].nr_free--;
>> +
>> +				while (start_order > order) {
>> +					start_order--;
>> +					size >>= 1;
>> +
>> +					list_add(&freepage[size].lru,
>> +						&cc->freepages[start_order].pages);
>> +					cc->freepages[start_order].nr_free++;
>> +					set_page_private(&freepage[size], start_order);
>
> IIUC, these split pages should also call functions to initialize? e.g. prep_compound_page()?

Not at this place. It is done right below and above "done" label. When free pages
are on cc->freepages, we want to keep them without being post_alloc_hook() or
prep_compound_page() processed for a possible future split. A free page is
only initialized when it is returned by compaction_alloc().

>
>> +				}
>> +				post_alloc_hook(freepage, order, __GFP_MOVABLE);
>> +				if (order)
>> +					prep_compound_page(freepage, order);
>> +				dst = page_folio(freepage);
>> +				goto done;
>> +			}
>> +		}
>> +		if (!has_isolated_pages) {
>> +			isolate_freepages(cc);
>> +			has_isolated_pages = true;
>> +			goto again;
>> +		}
>> +
>>   		if (!cc->freepages[order].nr_free)
>>   			return NULL;
>>   	}
>> @@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>   	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
>>   	if (order)
>>   		prep_compound_page(&dst->page, order);
>> +done:
>>   	cc->nr_freepages -= 1 << order;
>>   	return dst;
>>   }


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.
  2023-09-18 17:20     ` Zi Yan
@ 2023-09-20  8:15       ` Baolin Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Baolin Wang @ 2023-09-20  8:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard



On 9/19/2023 1:20 AM, Zi Yan wrote:
> On 18 Sep 2023, at 3:34, Baolin Wang wrote:
> 
>> On 9/13/2023 12:28 AM, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> During migration in a memory compaction, free pages are placed in an array
>>> of page lists based on their order. But the desired free page order (i.e.,
>>> the order of a source page) might not be always present, thus leading to
>>> migration failures. Split a high order free pages when source migration
>>> page has a lower order to increase migration successful rate.
>>>
>>> Note: merging free pages when a migration fails and a lower order free
>>> page is returned via compaction_free() is possible, but there is too much
>>> work. Since the free pages are not buddy pages, it is hard to identify
>>> these free pages using existing PFN-based page merging algorithm.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 39 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index 868e92e55d27..45747ab5f380 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>>    	struct compact_control *cc = (struct compact_control *)data;
>>>    	struct folio *dst;
>>>    	int order = folio_order(src);
>>> +	bool has_isolated_pages = false;
>>>   +again:
>>>    	if (!cc->freepages[order].nr_free) {
>>> -		isolate_freepages(cc);
>>> +		int i;
>>> +
>>> +		for (i = order + 1; i <= MAX_ORDER; i++) {
>>> +			if (cc->freepages[i].nr_free) {
>>> +				struct page *freepage =
>>> +					list_first_entry(&cc->freepages[i].pages,
>>> +							 struct page, lru);
>>> +
>>> +				int start_order = i;
>>> +				unsigned long size = 1 << start_order;
>>> +
>>> +				list_del(&freepage->lru);
>>> +				cc->freepages[i].nr_free--;
>>> +
>>> +				while (start_order > order) {
>>> +					start_order--;
>>> +					size >>= 1;
>>> +
>>> +					list_add(&freepage[size].lru,
>>> +						&cc->freepages[start_order].pages);
>>> +					cc->freepages[start_order].nr_free++;
>>> +					set_page_private(&freepage[size], start_order);
>>
>> IIUC, these split pages should also call functions to initialize? e.g. prep_compound_page()?
> 
> Not at this place. It is done right below and above "done" label. When free pages
> are on cc->freepages, we want to keep them without being post_alloc_hook() or
> prep_compound_page() processed for a possible future split. A free page is
> only initialized when it is returned by compaction_alloc().

Ah, I see. Thanks for explanation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.
  2023-09-12 16:28 ` [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios Zi Yan
  2023-09-15  9:41   ` Baolin Wang
@ 2023-09-20 14:44   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2023-09-20 14:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: oe-lkp, lkp, linux-mm, linux-kernel, Zi Yan, Ryan Roberts,
	Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard, oliver.sang



Hello,

kernel test robot noticed "kernel_BUG_at_lib/list_debug.c" on:

commit: 810d9ce367799ba4fef1e894b342e5ab74d44681 ("[RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.")
url: https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-compaction-add-support-for-0-order-folio-memory-compaction/20230913-003027
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20230912162815.440749-5-zi.yan@sent.com/
patch subject: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.

in testcase: vm-scalability
version: vm-scalability-x86_64-1.0-0_20220518
with following parameters:

	runtime: 300s
	test: lru-file-readtwice
	cpufreq_governor: performance

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/


compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480L (Sapphire Rapids) with 512G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309202236.2848083f-oliver.sang@intel.com


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230920/202309202236.2848083f-oliver.sang@intel.com


[  104.256019][ T1493] list_del corruption, ffd40001e3611490->prev is NULL
[  104.264911][ T1493] ------------[ cut here ]------------
[  104.272315][ T1493] kernel BUG at lib/list_debug.c:54!
[  104.279501][ T1493] invalid opcode: 0000 [#1] SMP NOPTI
[  104.286658][ T1493] CPU: 91 PID: 1493 Comm: kcompactd1 Not tainted 6.6.0-rc1-00153-g810d9ce36779 #1
[  104.298169][ T1493] Hardware name: NULL NULL/NULL, BIOS 05.02.01 05/12/2023
[  104.307252][ T1493] RIP: 0010:__list_del_entry_valid_or_report+0x6e/0xf0
[  104.315987][ T1493] Code: b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 c1 71 82 e8 e3 37 a3 ff 0f 0b 48 89 fe 48 c7 c7 b0 c1 71 82 e8 d2 37 a3 ff <0f> 0b 48 89 fe 48 c7 c7 e0 c1 71 82 e8 c1 37 a3 ff 0f 0b 48 89 fe
[  104.339068][ T1493] RSP: 0018:ffa0000010a37910 EFLAGS: 00010046
[  104.346919][ T1493] RAX: 0000000000000033 RBX: ff110080749b5ab8 RCX: 0000000000000000
[  104.356938][ T1493] RDX: 0000000000000000 RSI: ff11007f416dc6c0 RDI: ff11007f416dc6c0
[  104.366914][ T1493] RBP: ff110040b00af858 R08: 0000000000000000 R09: ffa0000010a377b8
[  104.376873][ T1493] R10: 0000000000000003 R11: ff11007f40dfffe8 R12: ffd40001e3611400
[  104.386808][ T1493] R13: 0000000000000000 R14: ffd40001e3611400 R15: ffa0000010a37938
[  104.396739][ T1493] FS:  0000000000000000(0000) GS:ff11007f416c0000(0000) knlGS:0000000000000000
[  104.407739][ T1493] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.416072][ T1493] CR2: 000055a550b5eb38 CR3: 0000008069078004 CR4: 0000000000f71ee0
[  104.425986][ T1493] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  104.435870][ T1493] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  104.445790][ T1493] PKRU: 55555554
[  104.450668][ T1493] Call Trace:
[  104.455221][ T1493]  <TASK>
[  104.459360][ T1493]  ? die+0x36/0xb0
[  104.464363][ T1493]  ? do_trap+0xda/0x130
[  104.469839][ T1493]  ? __list_del_entry_valid_or_report+0x6e/0xf0
[  104.477666][ T1493]  ? do_error_trap+0x65/0xb0
[  104.483614][ T1493]  ? __list_del_entry_valid_or_report+0x6e/0xf0
[  104.491418][ T1493]  ? exc_invalid_op+0x50/0x70
[  104.497453][ T1493]  ? __list_del_entry_valid_or_report+0x6e/0xf0
[  104.505246][ T1493]  ? asm_exc_invalid_op+0x1a/0x20
[  104.511655][ T1493]  ? __list_del_entry_valid_or_report+0x6e/0xf0
[  104.519423][ T1493]  split_huge_page_to_list+0x3ad/0x5b0
[  104.526306][ T1493]  migrate_pages_batch+0x1f6/0x970
[  104.532797][ T1493]  ? __pfx_compaction_alloc+0x10/0x10
[  104.539564][ T1493]  ? __pfx_compaction_free+0x10/0x10
[  104.546219][ T1493]  ? __pfx_compaction_alloc+0x10/0x10
[  104.552955][ T1493]  migrate_pages_sync+0x99/0x230
[  104.559201][ T1493]  ? __pfx_compaction_alloc+0x10/0x10
[  104.565917][ T1493]  ? __pfx_compaction_free+0x10/0x10
[  104.572522][ T1493]  migrate_pages+0x3d9/0x530
[  104.578341][ T1493]  ? __pfx_compaction_alloc+0x10/0x10
[  104.585033][ T1493]  ? __pfx_compaction_free+0x10/0x10
[  104.591617][ T1493]  compact_zone+0x286/0xa30
[  104.597313][ T1493]  kcompactd_do_work+0x103/0x2f0
[  104.603487][ T1493]  kcompactd+0x238/0x430
[  104.608873][ T1493]  ? __pfx_autoremove_wake_function+0x10/0x10
[  104.616315][ T1493]  ? __pfx_kcompactd+0x10/0x10
[  104.622284][ T1493]  kthread+0xcd/0x130
[  104.627371][ T1493]  ? __pfx_kthread+0x10/0x10
[  104.633117][ T1493]  ret_from_fork+0x31/0x70
[  104.638664][ T1493]  ? __pfx_kthread+0x10/0x10
[  104.644390][ T1493]  ret_from_fork_asm+0x1b/0x30
[  104.650309][ T1493]  </TASK>
[  104.654264][ T1493] Modules linked in: xfs loop btrfs intel_rapl_msr blake2b_generic intel_rapl_common xor ses x86_pkg_temp_thermal enclosure raid6_pq sd_mod scsi_transport_sas intel_powerclamp libcrc32c sg coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sha512_ssse3 ahci nvme_core rapl ast ipmi_ssif libahci t10_pi mei_me intel_cstate drm_shmem_helper crc64_rocksoft_generic i2c_i801 crc64_rocksoft acpi_ipmi drm_kms_helper megaraid_sas joydev dax_hmem intel_uncore libata mei i2c_ismt crc64 i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter drm fuse ip_tables
[  104.717626][ T1493] ---[ end trace 0000000000000000 ]---
[  104.807226][ T1493] RIP: 0010:__list_del_entry_valid_or_report+0x6e/0xf0
[  104.815628][ T1493] Code: b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 c1 71 82 e8 e3 37 a3 ff 0f 0b 48 89 fe 48 c7 c7 b0 c1 71 82 e8 d2 37 a3 ff <0f> 0b 48 89 fe 48 c7 c7 e0 c1 71 82 e8 c1 37 a3 ff 0f 0b 48 89 fe
[  104.838334][ T1493] RSP: 0018:ffa0000010a37910 EFLAGS: 00010046
[  104.845773][ T1493] RAX: 0000000000000033 RBX: ff110080749b5ab8 RCX: 0000000000000000
[  104.855343][ T1493] RDX: 0000000000000000 RSI: ff11007f416dc6c0 RDI: ff11007f416dc6c0
[  104.864898][ T1493] RBP: ff110040b00af858 R08: 0000000000000000 R09: ffa0000010a377b8
[  104.874452][ T1493] R10: 0000000000000003 R11: ff11007f40dfffe8 R12: ffd40001e3611400
[  104.884001][ T1493] R13: 0000000000000000 R14: ffd40001e3611400 R15: ffa0000010a37938
[  104.893543][ T1493] FS:  0000000000000000(0000) GS:ff11007f416c0000(0000) knlGS:0000000000000000
[  104.904152][ T1493] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.912119][ T1493] CR2: 000055a550b5eb38 CR3: 0000008069078004 CR4: 0000000000f71ee0
[  104.921634][ T1493] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  104.931149][ T1493] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  104.940655][ T1493] PKRU: 55555554
[  104.945174][ T1493] Kernel panic - not syncing: Fatal exception
[  105.991260][ T1493] Shutting down cpus with NMI
[  106.046902][ T1493] Kernel Offset: disabled

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
                   ` (3 preceding siblings ...)
  2023-09-12 16:28 ` [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios Zi Yan
@ 2023-09-21  0:55 ` Luis Chamberlain
  2023-09-21  1:16   ` Luis Chamberlain
  2023-10-02 12:32 ` Ryan Roberts
  2023-10-09  7:12 ` Huang, Ying
  6 siblings, 1 reply; 32+ messages in thread
From: Luis Chamberlain @ 2023-09-21  0:55 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Adam Manzanares, John Hubbard

On Tue, Sep 12, 2023 at 12:28:11PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Feel free to give comments and ask questions.

How about testing? I'm looking with an eye towards creating a
pathalogical situation which can be automated for fragmentation and see
how things go.

Mel Gorman's original artificial fragmentation taken from his first
patches ot help with fragmentation avoidance from 2018 suggested he
tried [0]:

------ From 2018
a) Create an XFS filesystem

b) Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate is
not used (fallocate=none). With multiple IO issuers this creates a mix
of slab and page cache allocations over time. The total size of the
files is 150% physical memory so that the slabs and page cache pages get
mixed

c) Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it took to
create the files. It'll fault back in old data and further interleave
slab and page cache allocations. As it's now low on memory due to step
2, fragmentation occurs as pageblocks get stolen. While step 3 is still
running, start a process that tries to allocate 75% of memory as huge
pages with a number of threads. The number of threads is based on a
(NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with
fio, any other threads or forcing cross-NUMA scheduling. Note that the
test has not been used on a machine with less than 8 cores. The
benchmark records whether huge pages were allocated and what the fault
latency was in microseconds

d) Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
------- end of extract

These days we can probably do a bit more damage. There has been concerns
that LBS support (block size > ps) could hinder fragmentation, one of
the reasons is that any file created despite it's size will require at
least the block size, and if using 64k block size that means 64k
allocation for each new file on that 64k block size filesystem, so
clearly you may run out of lower order allocations pretty quickly. You
can also create different larg eblock filesystems too, one for 64k
another for 32k. Although LBS is new and we're still ironing out the
kinks if you wanna give it a go we've rebased the patches onto Linus'
tree [1], and if you wanted to ramp up fast you could use kdevops [2] which
let's you pick that branch and also a series of NVMe drives (by enabling
CONFIG_LIBVIRT_EXTRA_STORAGE_DRIVE_NVME) for large IO experimentation (by
enabling CONFIG_VAGRANT_ENABLE_LARGEIO). Creating different filesystem
with large block size (64k, 32k, 16k) on a 4k sector size drive
(mkfs.xfs -f -b size=64k -s size=4k) should let you easily do tons of
crazy pathalogical things.

Are there other known recipes test help test this stuff?
How do we measure success in your patches for fragmentation exactly?

[0] https://lwn.net/Articles/770235/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev
[2] https://github.com/linux-kdevops/kdevops

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-21  0:55 ` [RFC PATCH 0/4] Enable >0 order folio memory compaction Luis Chamberlain
@ 2023-09-21  1:16   ` Luis Chamberlain
  2023-09-21  2:05     ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Luis Chamberlain @ 2023-09-21  1:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Adam Manzanares, John Hubbard

On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> Are there other known recipes test help test this stuff?

You know, it got me wondering... since how memory fragmented a system
might be by just running fstests, because, well, we already have
that automated in kdevops and it also has LBS support for all the
different large block sizes on 4k sector size. So if we just had a
way to "measure" or "quantify" memory fragmentation with a score,
we could just tally up how we did after 4 hours of testing for each
block size with a set of memory on the guest / target node / cloud
system.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-21  1:16   ` Luis Chamberlain
@ 2023-09-21  2:05     ` John Hubbard
  2023-09-21  3:14       ` Luis Chamberlain
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2023-09-21  2:05 UTC (permalink / raw)
  To: Luis Chamberlain, Zi Yan
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Adam Manzanares

On 9/20/23 18:16, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>> Are there other known recipes test help test this stuff?
> 
> You know, it got me wondering... since how memory fragmented a system
> might be by just running fstests, because, well, we already have
> that automated in kdevops and it also has LBS support for all the
> different large block sizes on 4k sector size. So if we just had a
> way to "measure" or "quantify" memory fragmentation with a score,
> we could just tally up how we did after 4 hours of testing for each
> block size with a set of memory on the guest / target node / cloud
> system.
> 
>    Luis

I thought about it, and here is one possible way to quantify
fragmentation with just a single number. Take this with some
skepticism because it is a first draft sort of thing:

a) Let BLOCKS be the number of 4KB pages (or more generally, then number
of smallest sized objects allowed) in the area.

b) Let FRAGS be the number of free *or* allocated chunks (no need to
consider the size of each, as that is automatically taken into
consideration).

Then:
       fragmentation percentage = (FRAGS / BLOCKS) * 100%

This has some nice properties. For one thing, it's easy to calculate.
For another, it can discern between these cases:

Assume a 12-page area:

Case 1) 6 pages allocated allocated unevenly:

1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated

fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%

Case 2) 6 pages allocated evenly: every other page is allocated:

fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-21  2:05     ` John Hubbard
@ 2023-09-21  3:14       ` Luis Chamberlain
  2023-09-21 15:56         ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Luis Chamberlain @ 2023-09-21  3:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Zi Yan, linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Adam Manzanares

On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
> On 9/20/23 18:16, Luis Chamberlain wrote:
> > On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> > > Are there other known recipes test help test this stuff?
> > 
> > You know, it got me wondering... since how memory fragmented a system
> > might be by just running fstests, because, well, we already have
> > that automated in kdevops and it also has LBS support for all the
> > different large block sizes on 4k sector size. So if we just had a
> > way to "measure" or "quantify" memory fragmentation with a score,
> > we could just tally up how we did after 4 hours of testing for each
> > block size with a set of memory on the guest / target node / cloud
> > system.
> > 
> >    Luis
> 
> I thought about it, and here is one possible way to quantify
> fragmentation with just a single number. Take this with some
> skepticism because it is a first draft sort of thing:
> 
> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
> of smallest sized objects allowed) in the area.
> 
> b) Let FRAGS be the number of free *or* allocated chunks (no need to
> consider the size of each, as that is automatically taken into
> consideration).
> 
> Then:
>       fragmentation percentage = (FRAGS / BLOCKS) * 100%
> 
> This has some nice properties. For one thing, it's easy to calculate.
> For another, it can discern between these cases:
> 
> Assume a 12-page area:
> 
> Case 1) 6 pages allocated allocated unevenly:
> 
> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
> 
> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
> 
> Case 2) 6 pages allocated evenly: every other page is allocated:
> 
> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%

Thanks! Will try this!

BTW stress-ng might also be a nice way to do other pathalogical things here.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-21  3:14       ` Luis Chamberlain
@ 2023-09-21 15:56         ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-09-21 15:56 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: John Hubbard, linux-mm, linux-kernel, Ryan Roberts,
	Andrew Morton, "Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Baolin Wang,
	Kemeng Shi, Mel Gorman, Rohan Puri, Adam Manzanares

[-- Attachment #1: Type: text/plain, Size: 2891 bytes --]


On 20 Sep 2023, at 23:14, Luis Chamberlain wrote:

> On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
>> On 9/20/23 18:16, Luis Chamberlain wrote:
>>> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>>>> Are there other known recipes test help test this stuff?
>>>
>>> You know, it got me wondering... since how memory fragmented a system
>>> might be by just running fstests, because, well, we already have
>>> that automated in kdevops and it also has LBS support for all the
>>> different large block sizes on 4k sector size. So if we just had a
>>> way to "measure" or "quantify" memory fragmentation with a score,
>>> we could just tally up how we did after 4 hours of testing for each
>>> block size with a set of memory on the guest / target node / cloud
>>> system.
>>>
>>>    Luis
>>
>> I thought about it, and here is one possible way to quantify
>> fragmentation with just a single number. Take this with some
>> skepticism because it is a first draft sort of thing:
>>
>> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
>> of smallest sized objects allowed) in the area.
>>
>> b) Let FRAGS be the number of free *or* allocated chunks (no need to
>> consider the size of each, as that is automatically taken into
>> consideration).
>>
>> Then:
>>       fragmentation percentage = (FRAGS / BLOCKS) * 100%
>>
>> This has some nice properties. For one thing, it's easy to calculate.
>> For another, it can discern between these cases:
>>
>> Assume a 12-page area:
>>
>> Case 1) 6 pages allocated allocated unevenly:
>>
>> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
>>
>> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
>>
>> Case 2) 6 pages allocated evenly: every other page is allocated:
>>
>> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%
>
> Thanks! Will try this!
>
> BTW stress-ng might also be a nice way to do other pathalogical things here.
>
>   Luis

Thanks. These are all good performance tests and a good fragmentation metric.
I would like to get it working properly first. As I mentioned in another email,
there will be tons of exploration to do to improve >0 folio memory compaction
with the consideration of:

1. the distribution of free pages,
2. the goal of compaction, e.g., to allocate a single order folio or reduce
the overall fragmentation level,
3. the runtime cost of compaction, and more.
My patchset aims to provide a reasonably working compaction functionality.


In terms of correctness testing, what I have done locally is to:

1. have a XFS partition,
2. create files with various sizes from 4KB to 2MB,
3. mmap each of these files to use one folio at the file size,
4. get the physical addresses of these folios,
5. trigger global memory compaction via sysctl,
6. read the physical addresses of these folios again.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
                   ` (4 preceding siblings ...)
  2023-09-21  0:55 ` [RFC PATCH 0/4] Enable >0 order folio memory compaction Luis Chamberlain
@ 2023-10-02 12:32 ` Ryan Roberts
  2023-10-09 13:24   ` Zi Yan
  2023-10-09  7:12 ` Huang, Ying
  6 siblings, 1 reply; 32+ messages in thread
From: Ryan Roberts @ 2023-10-02 12:32 UTC (permalink / raw)
  To: Zi Yan, linux-mm, linux-kernel
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

Hi Zi,

On 12/09/2023 17:28, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.

I've taken a quick look at these and realize I'm not well equipped to provide
much in the way of meaningful review comments; All I can say is thanks for
putting this together, and yes, I think it will become even more important for
my work on anonymous large folios.


> 
> Overview
> ===
> 
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction. Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
> 
> 
> Optimizations
> ===
> 
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
> 
> 2. Sort source pages in ascending order before migration is added to
> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.

Not knowing much about how compaction actually works, naively I would imagine
that if you are just trying to free up a known amount of contiguous physical
space, then working through the pages in PFN order is more likely to yield the
result quicker? Unless all of the pages in the set must be successfully migrated
in order to free up the required amount of space...

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
                   ` (5 preceding siblings ...)
  2023-10-02 12:32 ` Ryan Roberts
@ 2023-10-09  7:12 ` Huang, Ying
  2023-10-09 13:43   ` Zi Yan
  6 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2023-10-09  7:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Zi Yan, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

Hi, Zi,

Thanks for your patch!

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> Hi all,
>
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.
>
> Overview
> ===
>
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction.

migrate_pages() can split the large folio for allocation failure.  So
the minimal implementation could be

- allow to migrate large folios in compaction
- return -ENOMEM for order > 0 in compaction_alloc()

The performance may be not desirable.  But that may be a baseline for
further optimization.

And, if we can measure the performance for each step of optimization,
that will be even better.

> Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
>
>
> Optimizations
> ===
>
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
>
> 2. Sort source pages in ascending order before migration is added to

Trivial.

s/ascending/descending/

> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.
>
>
> TODOs
> ===
>
> 1. Refactor free page post allocation and free page preparation code so
> that compaction_alloc() and compaction_free() can call functions instead
> of hard coding.
>
> 2. One possible optimization is to allow migrate_pages() to continue
> even if get_new_folio() returns a NULL. In general, that means there is
> not enough memory. But in >0 order folio compaction case, that means
> there is no suitable free page at source page order. It might be better
> to skip that page and finish the rest of migration to achieve a better
> compaction result.

We can split the source folio if get_new_folio() returns NULL.  So, do
we really need this?

In general, we may reconsider all further optimizations given splitting
is available already.

> 3. Another possible optimization is to enable free page merge. It is
> possible that a to-be-migrated page causes free page split then fails to
> migrate eventually. We would lose a high order free page without free
> page merge function. But a way of identifying free pages for memory
> compaction is needed to reuse existing PFN-based buddy page merge.
>
> 4. The implemented >0 order folio compaction algorithm is quite naive
> and does not consider all possible situations. A better algorithm can
> improve compaction success rate.
>
>
> Feel free to give comments and ask questions.
>
> Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>
> Zi Yan (4):
>   mm/compaction: add support for >0 order folio memory compaction.
>   mm/compaction: optimize >0 order folio compaction with free page
>     split.
>   mm/compaction: optimize >0 order folio compaction by sorting source
>     pages.
>   mm/compaction: enable compacting >0 order folios.
>
>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>  mm/internal.h   |   7 +-
>  2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-02 12:32 ` Ryan Roberts
@ 2023-10-09 13:24   ` Zi Yan
  2023-10-09 14:10     ` Ryan Roberts
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-10-09 13:24 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-mm, linux-kernel, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Baolin Wang,
	Kemeng Shi, Mel Gorman, Rohan Puri, Mcgrof Chamberlain,
	Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 2852 bytes --]

On 2 Oct 2023, at 8:32, Ryan Roberts wrote:

> Hi Zi,
>
> On 12/09/2023 17:28, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>
> I've taken a quick look at these and realize I'm not well equipped to provide
> much in the way of meaningful review comments; All I can say is thanks for
> putting this together, and yes, I think it will become even more important for
> my work on anonymous large folios.
>
>
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction. Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>
> Not knowing much about how compaction actually works, naively I would imagine
> that if you are just trying to free up a known amount of contiguous physical
> space, then working through the pages in PFN order is more likely to yield the
> result quicker? Unless all of the pages in the set must be successfully migrated
> in order to free up the required amount of space...

During compaction, pages are not freed, since that is the job of page reclaim.
The goal of compaction is to get a high order free page without freeing existing
pages to avoid potential high cost IO operations. If compaction does not work,
page reclaim would free pages to get us there (and potentially another follow-up
compaction). So either pages are migrated or stay where they are during compaction.

BTW compaction works by scanning in use pages from lower PFN to higher PFN,
and free pages from higher PFN to lower PFN until two scanners meet in the middle.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-09  7:12 ` Huang, Ying
@ 2023-10-09 13:43   ` Zi Yan
  2023-10-10  6:08     ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-10-09 13:43 UTC (permalink / raw)
  To: "Huang, Ying"
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Baolin Wang,
	Kemeng Shi, Mel Gorman, Rohan Puri, Mcgrof Chamberlain,
	Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 5514 bytes --]

On 9 Oct 2023, at 3:12, Huang, Ying wrote:

> Hi, Zi,
>
> Thanks for your patch!
>
> Zi Yan <zi.yan@sent.com> writes:
>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction.
>
> migrate_pages() can split the large folio for allocation failure.  So
> the minimal implementation could be
>
> - allow to migrate large folios in compaction
> - return -ENOMEM for order > 0 in compaction_alloc()
>
> The performance may be not desirable.  But that may be a baseline for
> further optimization.

I would imagine it might cause a regression since compaction might gradually
split high order folios in the system. But I can move Patch 4 first to make this
the baseline and see how system performance changes.

>
> And, if we can measure the performance for each step of optimization,
> that will be even better.

Do you have any benchmark in mind for the performance tests? vm-scalability?

>
>> Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>
> Trivial.
>
> s/ascending/descending/
>
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>>
>>
>> TODOs
>> ===
>>
>> 1. Refactor free page post allocation and free page preparation code so
>> that compaction_alloc() and compaction_free() can call functions instead
>> of hard coding.
>>
>> 2. One possible optimization is to allow migrate_pages() to continue
>> even if get_new_folio() returns a NULL. In general, that means there is
>> not enough memory. But in >0 order folio compaction case, that means
>> there is no suitable free page at source page order. It might be better
>> to skip that page and finish the rest of migration to achieve a better
>> compaction result.
>
> We can split the source folio if get_new_folio() returns NULL.  So, do
> we really need this?

It depends. The situation it can benefit is that when the system is going
to allocate a high order free page and trigger a compaction, it is possible to
get the high order free page by migrating a bunch of base pages instead of
splitting a existing high order folio.

>
> In general, we may reconsider all further optimizations given splitting
> is available already.

In my mind, split should be avoided as much as possible. But it really depends
on the actual situation, e.g., how much effort and cost the compaction wants
to pay to get memory defragmented. If the system really wants to get a high
order free page at any cost, split can be used without any issue. But applications
might lose performance because existing large folios are split just to a
new one.

Like I said in the email, there are tons of optimizations and policies for us
to explore. We can start with the bare minimum support (if no performance
regression is observed, we can even start with split all high folios like you
suggested) and add optimizations one by one.

>
>> 3. Another possible optimization is to enable free page merge. It is
>> possible that a to-be-migrated page causes free page split then fails to
>> migrate eventually. We would lose a high order free page without free
>> page merge function. But a way of identifying free pages for memory
>> compaction is needed to reuse existing PFN-based buddy page merge.
>>
>> 4. The implemented >0 order folio compaction algorithm is quite naive
>> and does not consider all possible situations. A better algorithm can
>> improve compaction success rate.
>>
>>
>> Feel free to give comments and ask questions.
>>
>> Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>
>> Zi Yan (4):
>>   mm/compaction: add support for >0 order folio memory compaction.
>>   mm/compaction: optimize >0 order folio compaction with free page
>>     split.
>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>     pages.
>>   mm/compaction: enable compacting >0 order folios.
>>
>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>  mm/internal.h   |   7 +-
>>  2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-09 13:24   ` Zi Yan
@ 2023-10-09 14:10     ` Ryan Roberts
  2023-10-09 15:52       ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Ryan Roberts @ 2023-10-09 14:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

On 09/10/2023 14:24, Zi Yan wrote:
> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
> 
>> Hi Zi,
>>
>> On 12/09/2023 17:28, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>
>> I've taken a quick look at these and realize I'm not well equipped to provide
>> much in the way of meaningful review comments; All I can say is thanks for
>> putting this together, and yes, I think it will become even more important for
>> my work on anonymous large folios.
>>
>>
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction. Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>
>> Not knowing much about how compaction actually works, naively I would imagine
>> that if you are just trying to free up a known amount of contiguous physical
>> space, then working through the pages in PFN order is more likely to yield the
>> result quicker? Unless all of the pages in the set must be successfully migrated
>> in order to free up the required amount of space...
> 
> During compaction, pages are not freed, since that is the job of page reclaim.

Sorry yes - my fault for using sloppy language. When I said "free up a known
amount of contiguous physical space", I really meant "move pages in order to
recover an amount of contiguous physical space". But I still think the rest of
what I said applies; wouldn't you be more likely to reach your goal quicker if
you sort by PFN?

> The goal of compaction is to get a high order free page without freeing existing
> pages to avoid potential high cost IO operations. If compaction does not work,
> page reclaim would free pages to get us there (and potentially another follow-up
> compaction). So either pages are migrated or stay where they are during compaction.
> 
> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-09 14:10     ` Ryan Roberts
@ 2023-10-09 15:52       ` Zi Yan
  2023-10-10 10:00         ` Ryan Roberts
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2023-10-09 15:52 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-mm, linux-kernel, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Baolin Wang,
	Kemeng Shi, Mel Gorman, Rohan Puri, Mcgrof Chamberlain,
	Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 4237 bytes --]

(resent as plain text)
On 9 Oct 2023, at 10:10, Ryan Roberts wrote:

> On 09/10/2023 14:24, Zi Yan wrote:
>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>
>>> Hi Zi,
>>>
>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>
>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>> much in the way of meaningful review comments; All I can say is thanks for
>>> putting this together, and yes, I think it will become even more important for
>>> my work on anonymous large folios.
>>>
>>>
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction. Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>> Not knowing much about how compaction actually works, naively I would imagine
>>> that if you are just trying to free up a known amount of contiguous physical
>>> space, then working through the pages in PFN order is more likely to yield the
>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>> in order to free up the required amount of space...
>>
>> During compaction, pages are not freed, since that is the job of page reclaim.
>
> Sorry yes - my fault for using sloppy language. When I said "free up a known
> amount of contiguous physical space", I really meant "move pages in order to
> recover an amount of contiguous physical space". But I still think the rest of
> what I said applies; wouldn't you be more likely to reach your goal quicker if
> you sort by PFN?

Not always. If the in-use folios on the left are order-2, order-2, order-4
(all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
in-use folios in the middle), going in PFN order will not get you an order-8 free
page, since first order-4 free page will be split into two order-2 for the first
two order-2 in-use folios. But if you migrate in the the descending order of
in-use page orders, you can get an order-8 free page at the end.

The patchset minimizes free page splits to avoid the situation described above,
since once a high order free page is split, the opportunity of migrating a high order
in-use folio into it is gone and hardly recoverable.


>> The goal of compaction is to get a high order free page without freeing existing
>> pages to avoid potential high cost IO operations. If compaction does not work,
>> page reclaim would free pages to get us there (and potentially another follow-up
>> compaction). So either pages are migrated or stay where they are during compaction.
>>
>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>
>> --
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-09 13:43   ` Zi Yan
@ 2023-10-10  6:08     ` Huang, Ying
  2023-10-10 16:48       ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2023-10-10  6:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard


Something wrong with my mail box.  Sorry, if you received duplicated
mail.

Zi Yan <ziy@nvidia.com> writes:

> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>
>> Hi, Zi,
>>
>> Thanks for your patch!
>>
>> Zi Yan <zi.yan@sent.com> writes:
>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction.
>>
>> migrate_pages() can split the large folio for allocation failure.  So
>> the minimal implementation could be
>>
>> - allow to migrate large folios in compaction
>> - return -ENOMEM for order > 0 in compaction_alloc()
>>
>> The performance may be not desirable.  But that may be a baseline for
>> further optimization.
>
> I would imagine it might cause a regression since compaction might gradually
> split high order folios in the system.

I may not call it a pure regression, since large folio can be migrated
during compaction with that, but it's possible that this hurts
performance.

Anyway, this can be a not-so-good minimal baseline.

> But I can move Patch 4 first to make this the baseline and see how
> system performance changes.

Thanks!

>>
>> And, if we can measure the performance for each step of optimization,
>> that will be even better.
>
> Do you have any benchmark in mind for the performance tests? vm-scalability?

I remember Mel Gorman has done some tests for defragmentation before.
But that's for order-0 pages.

>>> Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>
>> Trivial.
>>
>> s/ascending/descending/
>>
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>>
>>> TODOs
>>> ===
>>>
>>> 1. Refactor free page post allocation and free page preparation code so
>>> that compaction_alloc() and compaction_free() can call functions instead
>>> of hard coding.
>>>
>>> 2. One possible optimization is to allow migrate_pages() to continue
>>> even if get_new_folio() returns a NULL. In general, that means there is
>>> not enough memory. But in >0 order folio compaction case, that means
>>> there is no suitable free page at source page order. It might be better
>>> to skip that page and finish the rest of migration to achieve a better
>>> compaction result.
>>
>> We can split the source folio if get_new_folio() returns NULL.  So, do
>> we really need this?
>
> It depends. The situation it can benefit is that when the system is going
> to allocate a high order free page and trigger a compaction, it is possible to
> get the high order free page by migrating a bunch of base pages instead of
> splitting a existing high order folio.
>
>>
>> In general, we may reconsider all further optimizations given splitting
>> is available already.
>
> In my mind, split should be avoided as much as possible.

If so, should we use "nosplit" logic in migrate_pages_batch() in some
situation?

> But it really depends
> on the actual situation, e.g., how much effort and cost the compaction wants
> to pay to get memory defragmented. If the system really wants to get a high
> order free page at any cost, split can be used without any issue. But applications
> might lose performance because existing large folios are split just to a
> new one.

Is it possible that splitting is desirable in some situation?  For
example, allocate some large DMA buffers at the cost of large anonymous
folios?

> Like I said in the email, there are tons of optimizations and policies for us
> to explore. We can start with the bare minimum support (if no performance
> regression is observed, we can even start with split all high folios like you
> suggested) and add optimizations one by one.

Sound good to me!  Thanks!

>>
>>> 3. Another possible optimization is to enable free page merge. It is
>>> possible that a to-be-migrated page causes free page split then fails to
>>> migrate eventually. We would lose a high order free page without free
>>> page merge function. But a way of identifying free pages for memory
>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>
>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>> and does not consider all possible situations. A better algorithm can
>>> improve compaction success rate.
>>>
>>>
>>> Feel free to give comments and ask questions.
>>>
>>> Thanks.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>>
>>> Zi Yan (4):
>>>   mm/compaction: add support for >0 order folio memory compaction.
>>>   mm/compaction: optimize >0 order folio compaction with free page
>>>     split.
>>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>>     pages.
>>>   mm/compaction: enable compacting >0 order folios.
>>>
>>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>  mm/internal.h   |   7 +-
>>>  2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/4] mm/compaction: add support for >0 order folio memory compaction.
  2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
  2023-09-12 17:32   ` Johannes Weiner
  2023-09-15  9:33   ` Baolin Wang
@ 2023-10-10  8:07   ` Huang, Ying
  2 siblings, 0 replies; 32+ messages in thread
From: Huang, Ying @ 2023-10-10  8:07 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Zi Yan, Ryan Roberts, Andrew Morton,
	Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> Before, memory compaction only migrates order-0 folios and skips >0 order
> folios. This commit adds support for >0 order folio compaction by keeping
> isolated free pages at their original size without splitting them into
> order-0 pages and using them directly during migration process.
>
> What is different from the prior implementation:
> 1. All isolated free pages are kept in a MAX_ORDER+1 array of page lists,
>    where each page list stores free pages in the same order.
> 2. All free pages are not post_alloc_hook() processed nor buddy pages,
>    although their orders are stored in first page's private like buddy
>    pages.
> 3. During migration, in new page allocation time (i.e., in
>    compaction_alloc()), free pages are then processed by post_alloc_hook().
>    When migration fails and a new page is returned (i.e., in
>    compaction_free()), free pages are restored by reversing the
>    post_alloc_hook() operations.
>
> Step 3 is done for a latter optimization that splitting and/or merging free
> pages during compaction becomes easier.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/compaction.c | 108 +++++++++++++++++++++++++++++++++++++++---------
>  mm/internal.h   |   7 +++-
>  2 files changed, 94 insertions(+), 21 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 01ba298739dd..868e92e55d27 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -107,6 +107,44 @@ static void split_map_pages(struct list_head *list)
>  	list_splice(&tmp_list, list);
>  }
>  
> +static unsigned long release_free_list(struct free_list *freepages)
> +{
> +	int order;
> +	unsigned long high_pfn = 0;
> +
> +	for (order = 0; order <= MAX_ORDER; order++) {
> +		struct page *page, *next;
> +
> +		list_for_each_entry_safe(page, next, &freepages[order].pages, lru) {
> +			unsigned long pfn = page_to_pfn(page);
> +
> +			list_del(&page->lru);
> +			/*
> +			 * Convert free pages into post allocation pages, so
> +			 * that we can free them via __free_page.
> +			 */
> +			post_alloc_hook(page, order, __GFP_MOVABLE);
> +			__free_pages(page, order);
> +			if (pfn > high_pfn)
> +				high_pfn = pfn;
> +		}
> +	}
> +	return high_pfn;
> +}
> +
> +static void sort_free_pages(struct list_head *src, struct free_list *dst)
> +{
> +	unsigned int order;
> +	struct page *page, *next;
> +
> +	list_for_each_entry_safe(page, next, src, lru) {
> +		order = buddy_order(page);
> +
> +		list_move(&page->lru, &dst[order].pages);
> +		dst[order].nr_free++;
> +	}
> +}
> +
>  #ifdef CONFIG_COMPACTION
>  bool PageMovable(struct page *page)
>  {
> @@ -1422,6 +1460,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>  {
>  	unsigned long start_pfn, end_pfn;
>  	struct page *page;
> +	LIST_HEAD(freelist);
>  
>  	/* Do not search around if there are enough pages already */
>  	if (cc->nr_freepages >= cc->nr_migratepages)
> @@ -1439,7 +1478,8 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
>  	if (!page)
>  		return;
>  
> -	isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
> +	isolate_freepages_block(cc, &start_pfn, end_pfn, &freelist, 1, false);
> +	sort_free_pages(&freelist, cc->freepages);
>  
>  	/* Skip this pageblock in the future as it's full or nearly full */
>  	if (start_pfn == end_pfn && !cc->no_set_skip_hint)
> @@ -1568,7 +1608,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
>  				nr_scanned += nr_isolated - 1;
>  				total_isolated += nr_isolated;
>  				cc->nr_freepages += nr_isolated;
> -				list_add_tail(&page->lru, &cc->freepages);
> +				list_add_tail(&page->lru, &cc->freepages[order].pages);
>  				count_compact_events(COMPACTISOLATED, nr_isolated);
>  			} else {
>  				/* If isolation fails, abort the search */
> @@ -1642,13 +1682,13 @@ static void isolate_freepages(struct compact_control *cc)
>  	unsigned long isolate_start_pfn; /* exact pfn we start at */
>  	unsigned long block_end_pfn;	/* end of current pageblock */
>  	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
> -	struct list_head *freelist = &cc->freepages;
>  	unsigned int stride;
> +	LIST_HEAD(freelist);
>  
>  	/* Try a small search of the free lists for a candidate */
>  	fast_isolate_freepages(cc);
>  	if (cc->nr_freepages)
> -		goto splitmap;
> +		return;
>  
>  	/*
>  	 * Initialise the free scanner. The starting point is where we last
> @@ -1708,7 +1748,8 @@ static void isolate_freepages(struct compact_control *cc)
>  
>  		/* Found a block suitable for isolating free pages from. */
>  		nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> -					block_end_pfn, freelist, stride, false);
> +					block_end_pfn, &freelist, stride, false);
> +		sort_free_pages(&freelist, cc->freepages);
>  
>  		/* Update the skip hint if the full pageblock was scanned */
>  		if (isolate_start_pfn == block_end_pfn)
> @@ -1749,10 +1790,6 @@ static void isolate_freepages(struct compact_control *cc)
>  	 * and the loop terminated due to isolate_start_pfn < low_pfn
>  	 */
>  	cc->free_pfn = isolate_start_pfn;
> -
> -splitmap:
> -	/* __isolate_free_page() does not map the pages */
> -	split_map_pages(freelist);
>  }
>  
>  /*
> @@ -1763,18 +1800,21 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>  {
>  	struct compact_control *cc = (struct compact_control *)data;
>  	struct folio *dst;
> +	int order = folio_order(src);
>  
> -	if (list_empty(&cc->freepages)) {
> +	if (!cc->freepages[order].nr_free) {
>  		isolate_freepages(cc);
> -
> -		if (list_empty(&cc->freepages))
> +		if (!cc->freepages[order].nr_free)
>  			return NULL;
>  	}
>  
> -	dst = list_entry(cc->freepages.next, struct folio, lru);
> +	dst = list_first_entry(&cc->freepages[order].pages, struct folio, lru);
> +	cc->freepages[order].nr_free--;
>  	list_del(&dst->lru);
> -	cc->nr_freepages--;
> -
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> +	if (order)
> +		prep_compound_page(&dst->page, order);
> +	cc->nr_freepages -= 1 << order;
>  	return dst;
>  }
>  
> @@ -1786,9 +1826,34 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>  static void compaction_free(struct folio *dst, unsigned long data)
>  {
>  	struct compact_control *cc = (struct compact_control *)data;
> +	int order = folio_order(dst);
> +	struct page *page = &dst->page;
>  
> -	list_add(&dst->lru, &cc->freepages);
> -	cc->nr_freepages++;
> +	if (order) {
> +		int i;
> +
> +		page[1].flags &= ~PAGE_FLAGS_SECOND;
> +		for (i = 1; i < (1 << order); i++) {
> +			page[i].mapping = NULL;
> +			clear_compound_head(&page[i]);
> +			page[i].flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		}
> +
> +	}
> +	/* revert post_alloc_hook() operations */
> +	page->mapping = NULL;
> +	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +	set_page_count(page, 0);
> +	page_mapcount_reset(page);
> +	reset_page_owner(page, order);
> +	page_table_check_free(page, order);
> +	arch_free_page(page, order);
> +	set_page_private(page, order);
> +	INIT_LIST_HEAD(&dst->lru);
> +
> +	list_add(&dst->lru, &cc->freepages[order].pages);
> +	cc->freepages[order].nr_free++;
> +	cc->nr_freepages += 1 << order;
>  }
>  
>  /* possible outcome of isolate_migratepages */
> @@ -2412,6 +2477,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	const bool sync = cc->mode != MIGRATE_ASYNC;
>  	bool update_cached;
>  	unsigned int nr_succeeded = 0;
> +	int order;
>  
>  	/*
>  	 * These counters track activities during zone compaction.  Initialize
> @@ -2421,7 +2487,10 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	cc->total_free_scanned = 0;
>  	cc->nr_migratepages = 0;
>  	cc->nr_freepages = 0;
> -	INIT_LIST_HEAD(&cc->freepages);
> +	for (order = 0; order <= MAX_ORDER; order++) {
> +		INIT_LIST_HEAD(&cc->freepages[order].pages);
> +		cc->freepages[order].nr_free = 0;
> +	}
>  	INIT_LIST_HEAD(&cc->migratepages);
>  
>  	cc->migratetype = gfp_migratetype(cc->gfp_mask);
> @@ -2607,7 +2676,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	 * so we don't leave any returned pages behind in the next attempt.
>  	 */
>  	if (cc->nr_freepages > 0) {
> -		unsigned long free_pfn = release_freepages(&cc->freepages);
> +		unsigned long free_pfn = release_free_list(cc->freepages);
>  
>  		cc->nr_freepages = 0;
>  		VM_BUG_ON(free_pfn == 0);
> @@ -2626,7 +2695,6 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  
>  	trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret);
>  
> -	VM_BUG_ON(!list_empty(&cc->freepages));
>  	VM_BUG_ON(!list_empty(&cc->migratepages));
>  
>  	return ret;
> diff --git a/mm/internal.h b/mm/internal.h
> index 8c90e966e9f8..f5c691bb5c1c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -465,6 +465,11 @@ int split_free_page(struct page *free_page,
>  /*
>   * in mm/compaction.c
>   */
> +
> +struct free_list {
> +	struct list_head	pages;
> +	unsigned long		nr_free;

Do we really need nr_free?  Is it enough just to use
list_empty(&free_list->pages)?

> +};
>  /*
>   * compact_control is used to track pages being migrated and the free pages
>   * they are being migrated to during memory compaction. The free_pfn starts
> @@ -473,7 +478,7 @@ int split_free_page(struct page *free_page,
>   * completes when free_pfn <= migrate_pfn
>   */
>  struct compact_control {
> -	struct list_head freepages;	/* List of free pages to migrate to */
> +	struct free_list freepages[MAX_ORDER + 1];	/* List of free pages to migrate to */
>  	struct list_head migratepages;	/* List of pages being migrated */
>  	unsigned int nr_freepages;	/* Number of isolated free pages */
>  	unsigned int nr_migratepages;	/* Number of pages to migrate */

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-09 15:52       ` Zi Yan
@ 2023-10-10 10:00         ` Ryan Roberts
  0 siblings, 0 replies; 32+ messages in thread
From: Ryan Roberts @ 2023-10-10 10:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Yin, Fengwei, Yu Zhao, Vlastimil Babka,
	Johannes Weiner, Baolin Wang, Kemeng Shi, Mel Gorman, Rohan Puri,
	Mcgrof Chamberlain, Adam Manzanares, John Hubbard

On 09/10/2023 16:52, Zi Yan wrote:
> (resent as plain text)
> On 9 Oct 2023, at 10:10, Ryan Roberts wrote:
> 
>> On 09/10/2023 14:24, Zi Yan wrote:
>>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>>
>>>> Hi Zi,
>>>>
>>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>>> the prerequisitions for large folio support[1]. It is on top of
>>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>>> much in the way of meaningful review comments; All I can say is thanks for
>>>> putting this together, and yes, I think it will become even more important for
>>>> my work on anonymous large folios.
>>>>
>>>>
>>>>>
>>>>> Overview
>>>>> ===
>>>>>
>>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>>> for migration are kept during compaction. Free pages used to be split into
>>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>>> on their order without post allocation process. When migrate_pages() asks for
>>>>> a new page, one of the free pages, based on the requested page order, is
>>>>> then processed and given out.
>>>>>
>>>>>
>>>>> Optimizations
>>>>> ===
>>>>>
>>>>> 1. Free page split is added to increase migration success rate in case
>>>>> a source page does not have a matched free page in the free page lists.
>>>>> Free page merge is possible but not implemented, since existing
>>>>> PFN-based buddy page merge algorithm requires the identification of
>>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>>
>>>>> 2. Sort source pages in ascending order before migration is added to
>>>>> reduce free page split. Otherwise, high order free pages might be
>>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>> Not knowing much about how compaction actually works, naively I would imagine
>>>> that if you are just trying to free up a known amount of contiguous physical
>>>> space, then working through the pages in PFN order is more likely to yield the
>>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>>> in order to free up the required amount of space...
>>>
>>> During compaction, pages are not freed, since that is the job of page reclaim.
>>
>> Sorry yes - my fault for using sloppy language. When I said "free up a known
>> amount of contiguous physical space", I really meant "move pages in order to
>> recover an amount of contiguous physical space". But I still think the rest of
>> what I said applies; wouldn't you be more likely to reach your goal quicker if
>> you sort by PFN?
> 
> Not always. If the in-use folios on the left are order-2, order-2, order-4
> (all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
> order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
> in-use folios in the middle), going in PFN order will not get you an order-8 free
> page, since first order-4 free page will be split into two order-2 for the first
> two order-2 in-use folios. But if you migrate in the the descending order of
> in-use page orders, you can get an order-8 free page at the end.
> 
> The patchset minimizes free page splits to avoid the situation described above,
> since once a high order free page is split, the opportunity of migrating a high order
> in-use folio into it is gone and hardly recoverable.

OK I get it now - thanks!

> 
> 
>>> The goal of compaction is to get a high order free page without freeing existing
>>> pages to avoid potential high cost IO operations. If compaction does not work,
>>> page reclaim would free pages to get us there (and potentially another follow-up
>>> compaction). So either pages are migrated or stay where they are during compaction.
>>>
>>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction
  2023-10-10  6:08     ` Huang, Ying
@ 2023-10-10 16:48       ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2023-10-10 16:48 UTC (permalink / raw)
  To: "Huang, Ying"
  Cc: linux-mm, linux-kernel, Ryan Roberts, Andrew Morton,
	"Matthew Wilcox (Oracle)",
	David Hildenbrand, "Yin, Fengwei",
	Yu Zhao, Vlastimil Babka, Johannes Weiner, Baolin Wang,
	Kemeng Shi, Mel Gorman, Rohan Puri, Mcgrof Chamberlain,
	Adam Manzanares, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 6808 bytes --]

On 10 Oct 2023, at 2:08, Huang, Ying wrote:

> Something wrong with my mail box.  Sorry, if you received duplicated
> mail.
>
> Zi Yan <ziy@nvidia.com> writes:
>
>> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>>
>>> Hi, Zi,
>>>
>>> Thanks for your patch!
>>>
>>> Zi Yan <zi.yan@sent.com> writes:
>>>
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction.
>>>
>>> migrate_pages() can split the large folio for allocation failure.  So
>>> the minimal implementation could be
>>>
>>> - allow to migrate large folios in compaction
>>> - return -ENOMEM for order > 0 in compaction_alloc()
>>>
>>> The performance may be not desirable.  But that may be a baseline for
>>> further optimization.
>>
>> I would imagine it might cause a regression since compaction might gradually
>> split high order folios in the system.
>
> I may not call it a pure regression, since large folio can be migrated
> during compaction with that, but it's possible that this hurts
> performance.
>
> Anyway, this can be a not-so-good minimal baseline.
>
>> But I can move Patch 4 first to make this the baseline and see how
>> system performance changes.
>
> Thanks!
>
>>>
>>> And, if we can measure the performance for each step of optimization,
>>> that will be even better.
>>
>> Do you have any benchmark in mind for the performance tests? vm-scalability?
>
> I remember Mel Gorman has done some tests for defragmentation before.
> But that's for order-0 pages.

OK, I will try to find that.

>
>>>> Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>
>>> Trivial.
>>>
>>> s/ascending/descending/
>>>
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>>
>>>> TODOs
>>>> ===
>>>>
>>>> 1. Refactor free page post allocation and free page preparation code so
>>>> that compaction_alloc() and compaction_free() can call functions instead
>>>> of hard coding.
>>>>
>>>> 2. One possible optimization is to allow migrate_pages() to continue
>>>> even if get_new_folio() returns a NULL. In general, that means there is
>>>> not enough memory. But in >0 order folio compaction case, that means
>>>> there is no suitable free page at source page order. It might be better
>>>> to skip that page and finish the rest of migration to achieve a better
>>>> compaction result.
>>>
>>> We can split the source folio if get_new_folio() returns NULL.  So, do
>>> we really need this?
>>
>> It depends. The situation it can benefit is that when the system is going
>> to allocate a high order free page and trigger a compaction, it is possible to
>> get the high order free page by migrating a bunch of base pages instead of
>> splitting a existing high order folio.
>>
>>>
>>> In general, we may reconsider all further optimizations given splitting
>>> is available already.
>>
>> In my mind, split should be avoided as much as possible.
>
> If so, should we use "nosplit" logic in migrate_pages_batch() in some
> situation?

A possible future optimization.

>
>> But it really depends
>> on the actual situation, e.g., how much effort and cost the compaction wants
>> to pay to get memory defragmented. If the system really wants to get a high
>> order free page at any cost, split can be used without any issue. But applications
>> might lose performance because existing large folios are split just to a
>> new one.
>
> Is it possible that splitting is desirable in some situation?  For
> example, allocate some large DMA buffers at the cost of large anonymous
> folios?

Sure. There are definitely cases split is better than non-split. But let's leave
it when large anonymous folio is deployed.

>
>> Like I said in the email, there are tons of optimizations and policies for us
>> to explore. We can start with the bare minimum support (if no performance
>> regression is observed, we can even start with split all high folios like you
>> suggested) and add optimizations one by one.
>
> Sound good to me!  Thanks!
>
>>>
>>>> 3. Another possible optimization is to enable free page merge. It is
>>>> possible that a to-be-migrated page causes free page split then fails to
>>>> migrate eventually. We would lose a high order free page without free
>>>> page merge function. But a way of identifying free pages for memory
>>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>>
>>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>>> and does not consider all possible situations. A better algorithm can
>>>> improve compaction success rate.
>>>>
>>>>
>>>> Feel free to give comments and ask questions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>>>
>>>> Zi Yan (4):
>>>>   mm/compaction: add support for >0 order folio memory compaction.
>>>>   mm/compaction: optimize >0 order folio compaction with free page
>>>>     split.
>>>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>>>     pages.
>>>>   mm/compaction: enable compacting >0 order folios.
>>>>
>>>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>>  mm/internal.h   |   7 +-
>>>>  2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-10-10 16:48 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-12 16:28 [RFC PATCH 0/4] Enable >0 order folio memory compaction Zi Yan
2023-09-12 16:28 ` [RFC PATCH 1/4] mm/compaction: add support for " Zi Yan
2023-09-12 17:32   ` Johannes Weiner
2023-09-12 17:38     ` Zi Yan
2023-09-15  9:33   ` Baolin Wang
2023-09-18 17:06     ` Zi Yan
2023-10-10  8:07   ` Huang, Ying
2023-09-12 16:28 ` [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split Zi Yan
2023-09-18  7:34   ` Baolin Wang
2023-09-18 17:20     ` Zi Yan
2023-09-20  8:15       ` Baolin Wang
2023-09-12 16:28 ` [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages Zi Yan
2023-09-12 17:56   ` Johannes Weiner
2023-09-12 20:31     ` Zi Yan
2023-09-12 16:28 ` [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios Zi Yan
2023-09-15  9:41   ` Baolin Wang
2023-09-18 17:17     ` Zi Yan
2023-09-20 14:44   ` kernel test robot
2023-09-21  0:55 ` [RFC PATCH 0/4] Enable >0 order folio memory compaction Luis Chamberlain
2023-09-21  1:16   ` Luis Chamberlain
2023-09-21  2:05     ` John Hubbard
2023-09-21  3:14       ` Luis Chamberlain
2023-09-21 15:56         ` Zi Yan
2023-10-02 12:32 ` Ryan Roberts
2023-10-09 13:24   ` Zi Yan
2023-10-09 14:10     ` Ryan Roberts
2023-10-09 15:52       ` Zi Yan
2023-10-10 10:00         ` Ryan Roberts
2023-10-09  7:12 ` Huang, Ying
2023-10-09 13:43   ` Zi Yan
2023-10-10  6:08     ` Huang, Ying
2023-10-10 16:48       ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).