All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
@ 2023-09-11 19:41 Johannes Weiner
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
                   ` (6 more replies)
  0 siblings, 7 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

V2:
- dropped the get_pfnblock_migratetype() optimization
  patchlet since somebody else beat me to it (thanks Zi)
- broke out pcp bypass fix since somebody else reported the bug:
  https://lore.kernel.org/linux-mm/20230911181108.GA104295@cmpxchg.org/
- fixed the CONFIG_UNACCEPTED_MEMORY build (lkp)
- rebased to v6.6-rc1

The series is based on v6.6-rc1 plus the pcp bypass fix above ^

---

This is a breakout series from the huge page allocator patches[1].

While testing and benchmarking the series incrementally, as per
reviewer request, it became apparent that there are several sources of
freelist migratetype violations that later patches in the series hid.

Those violations occur when pages of one migratetype end up on the
freelists of another type. This encourages incompatible page mixing
down the line, where allocation requests ask for one migrate type, but
receives pages of another. This defeats the mobility grouping.

The series addresses those causes. The last patch adds type checks on
all freelist movements to rule out any violations. I used these checks
to identify the violations fixed up in the preceding patches.

The series is a breakout, but has merit on its own: Less type mixing
means improved grouping, means less work for compaction, means higher
THP success rate and lower allocation latencies. The results can be
seen in a mixed workload that stresses the machine with a kernel build
job while periodically attempting to allocate batches of THP. The data
is averaged over 50 consecutive defconfig builds:

                                                        VANILLA      PATCHED-CLEANLISTS
Hugealloc Time median                     14642.00 (    +0.00%)   10506.00 (   -28.25%)
Hugealloc Time min                         4820.00 (    +0.00%)    4783.00 (    -0.77%)
Hugealloc Time max                      6786868.00 (    +0.00%) 6556624.00 (    -3.39%)
Kbuild Real time                            240.03 (    +0.00%)     241.45 (    +0.59%)
Kbuild User time                           1195.49 (    +0.00%)    1195.69 (    +0.02%)
Kbuild System time                           96.44 (    +0.00%)      97.03 (    +0.61%)
THP fault alloc                           11490.00 (    +0.00%)   11802.30 (    +2.72%)
THP fault fallback                          782.62 (    +0.00%)     478.88 (   -38.76%)
THP fault fail rate %                         6.38 (    +0.00%)       3.90 (   -33.52%)
Direct compact stall                        297.70 (    +0.00%)     224.56 (   -24.49%)
Direct compact fail                         265.98 (    +0.00%)     191.56 (   -27.87%)
Direct compact success                       31.72 (    +0.00%)      33.00 (    +3.91%)
Direct compact success rate %                13.11 (    +0.00%)      17.26 (   +29.43%)
Compact daemon scanned migrate          1673661.58 (    +0.00%) 1591682.18 (    -4.90%)
Compact daemon scanned free             2711252.80 (    +0.00%) 2615217.78 (    -3.54%)
Compact direct scanned migrate           384998.62 (    +0.00%)  261689.42 (   -32.03%)
Compact direct scanned free              966308.94 (    +0.00%)  667459.76 (   -30.93%)
Compact migrate scanned daemon %             80.86 (    +0.00%)      83.34 (    +3.02%)
Compact free scanned daemon %                74.41 (    +0.00%)      78.26 (    +5.10%)
Alloc stall                                 338.06 (    +0.00%)     440.72 (   +30.28%)
Pages kswapd scanned                    1356339.42 (    +0.00%) 1402313.42 (    +3.39%)
Pages kswapd reclaimed                   581309.08 (    +0.00%)  587956.82 (    +1.14%)
Pages direct scanned                      56384.18 (    +0.00%)  141095.04 (  +150.24%)
Pages direct reclaimed                    17055.54 (    +0.00%)   22427.96 (   +31.50%)
Pages scanned kswapd %                       96.38 (    +0.00%)      93.60 (    -2.86%)
Swap out                                  41528.00 (    +0.00%)   47969.92 (   +15.51%)
Swap in                                    6541.42 (    +0.00%)    9093.30 (   +39.01%)
File refaults                            127666.50 (    +0.00%)  135766.84 (    +6.34%)

 include/linux/mm.h             |  18 +-
 include/linux/page-isolation.h |   2 +-
 include/linux/vmstat.h         |   8 -
 mm/debug_page_alloc.c          |  12 +-
 mm/internal.h                  |   5 -
 mm/page_alloc.c                | 357 ++++++++++++++++++---------------
 mm/page_isolation.c            |  23 ++-
 7 files changed, 217 insertions(+), 208 deletions(-)



^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-11 19:59   ` Zi Yan
                     ` (3 more replies)
  2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
                   ` (5 subsequent siblings)
  6 siblings, 4 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

The idea behind the cache is to save get_pageblock_migratetype()
lookups during bulk freeing. A microbenchmark suggests this isn't
helping, though. The pcp migratetype can get stale, which means that
bulk freeing has an extra branch to check if the pageblock was
isolated while on the pcp.

While the variance overlaps, the cache write and the branch seem to
make this a net negative. The following test allocates and frees
batches of 10,000 pages (~3x the pcp high marks to trigger flushing):

Before:
          8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
                19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
                 0      cpu-migrations                   #    0.000 /sec
            17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
    41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
   126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
    25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
        33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )

         0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )

After:
          8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
                22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
                 0      cpu-migrations                   #    0.000 /sec
            17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
    40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
   126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
    25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
        32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )

         0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )

A side effect is that this also ensures that pages whose pageblock
gets stolen while on the pcplist end up on the right freelist and we
don't perform potentially type-incompatible buddy merges (or skip
merges when we shouldn't), whis is likely beneficial to long-term
fragmentation management, although the effects would be harder to
measure. Settle for simpler and faster code as justification here.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 61 ++++++++++++-------------------------------------
 1 file changed, 14 insertions(+), 47 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f376302..e3f1c777feed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -204,24 +204,6 @@ EXPORT_SYMBOL(node_states);
 
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
-/*
- * A cached value of the page's pageblock's migratetype, used when the page is
- * put on a pcplist. Used to avoid the pageblock migratetype lookup when
- * freeing from pcplists in most cases, at the cost of possibly becoming stale.
- * Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
- */
-static inline int get_pcppage_migratetype(struct page *page)
-{
-	return page->index;
-}
-
-static inline void set_pcppage_migratetype(struct page *page, int migratetype)
-{
-	page->index = migratetype;
-}
-
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 unsigned int pageblock_order __read_mostly;
 #endif
@@ -1186,7 +1168,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	unsigned long flags;
 	unsigned int order;
-	bool isolated_pageblocks;
 	struct page *page;
 
 	/*
@@ -1199,7 +1180,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	pindex = pindex - 1;
 
 	spin_lock_irqsave(&zone->lock, flags);
-	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	while (count > 0) {
 		struct list_head *list;
@@ -1215,10 +1195,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		order = pindex_to_order(pindex);
 		nr_pages = 1 << order;
 		do {
+			unsigned long pfn;
 			int mt;
 
 			page = list_last_entry(list, struct page, pcp_list);
-			mt = get_pcppage_migratetype(page);
+			pfn = page_to_pfn(page);
+			mt = get_pfnblock_migratetype(page, pfn);
 
 			/* must delete to avoid corrupting pcp list */
 			list_del(&page->pcp_list);
@@ -1227,11 +1209,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 			/* MIGRATE_ISOLATE page should not go to pcplists */
 			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-			/* Pageblock could have been isolated meanwhile */
-			if (unlikely(isolated_pageblocks))
-				mt = get_pageblock_migratetype(page);
 
-			__free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE);
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
 	}
@@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 			continue;
 		del_page_from_free_list(page, zone, current_order);
 		expand(zone, page, order, current_order, migratetype);
-		set_pcppage_migratetype(page, migratetype);
 		trace_mm_page_alloc_zone_locked(page, order, migratetype,
 				pcp_allowed_order(order) &&
 				migratetype < MIGRATE_PCPTYPES);
@@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
-		if (is_migrate_cma(get_pcppage_migratetype(page)))
+		if (is_migrate_cma(get_pageblock_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
@@ -2304,19 +2282,6 @@ void drain_all_pages(struct zone *zone)
 	__drain_all_pages(zone, false);
 }
 
-static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
-							unsigned int order)
-{
-	int migratetype;
-
-	if (!free_pages_prepare(page, order, FPI_NONE))
-		return false;
-
-	migratetype = get_pfnblock_migratetype(page, pfn);
-	set_pcppage_migratetype(page, migratetype);
-	return true;
-}
-
 static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 {
 	int min_nr_free, max_nr_free;
@@ -2402,7 +2367,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype, pcpmigratetype;
 
-	if (!free_unref_page_prepare(page, pfn, order))
+	if (!free_pages_prepare(page, order, FPI_NONE))
 		return;
 
 	/*
@@ -2412,7 +2377,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	 * get those areas back if necessary. Otherwise, we may have to free
 	 * excessively into the page allocator
 	 */
-	migratetype = pcpmigratetype = get_pcppage_migratetype(page);
+	migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
@@ -2448,7 +2413,8 @@ void free_unref_page_list(struct list_head *list)
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
-		if (!free_unref_page_prepare(page, pfn, 0)) {
+
+		if (!free_pages_prepare(page, 0, FPI_NONE)) {
 			list_del(&page->lru);
 			continue;
 		}
@@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
 		 * Free isolated pages directly to the allocator, see
 		 * comment in free_unref_page.
 		 */
-		migratetype = get_pcppage_migratetype(page);
+		migratetype = get_pfnblock_migratetype(page, pfn);
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			list_del(&page->lru);
 			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
@@ -2466,10 +2432,11 @@ void free_unref_page_list(struct list_head *list)
 	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
+		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
 
 		list_del(&page->lru);
-		migratetype = get_pcppage_migratetype(page);
+		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
 		 * Either different zone requiring a different pcp lock or
@@ -2492,7 +2459,7 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, page_to_pfn(page),
+				free_one_page(zone, page, pfn,
 					      0, migratetype, FPI_NONE);
 				locked_zone = NULL;
 				continue;
@@ -2661,7 +2628,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			}
 		}
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
+					  get_pageblock_migratetype(page));
 		spin_unlock_irqrestore(&zone->lock, flags);
 	} while (check_new_pages(page, order));
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-11 20:01   ` Zi Yan
                     ` (2 more replies)
  2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

The buddy allocator coalesces compatible blocks during freeing, but it
doesn't update the types of the subblocks to match. When an allocation
later breaks the chunk down again, its pieces will be put on freelists
of the wrong type. This encourages incompatible page mixing (ask for
one type, get another), and thus long-term fragmentation.

Update the subblocks when merging a larger chunk, such that a later
expand() will maintain freelist type hygiene.

v2:
- remove spurious change_pageblock_range() move (Zi Yan)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..3db405414174 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -783,10 +783,17 @@ static inline void __free_one_page(struct page *page,
 			 */
 			int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
 
-			if (migratetype != buddy_mt
-					&& (!migratetype_is_mergeable(migratetype) ||
-						!migratetype_is_mergeable(buddy_mt)))
-				goto done_merging;
+			if (migratetype != buddy_mt) {
+				if (!migratetype_is_mergeable(migratetype) ||
+				    !migratetype_is_mergeable(buddy_mt))
+					goto done_merging;
+				/*
+				 * Match buddy type. This ensures that
+				 * an expand() down the line puts the
+				 * sub-blocks on the right freelists.
+				 */
+				set_pageblock_migratetype(buddy, migratetype);
+			}
 		}
 
 		/*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
  2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-11 20:17   ` Zi Yan
                     ` (2 more replies)
  2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

When claiming a block during compaction isolation, move any remaining
free pages to the correct freelists as well, instead of stranding them
on the wrong list. Otherwise, this encourages incompatible page mixing
down the line, and thus long-term fragmentation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3db405414174..f6f658c3d394 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2548,9 +2548,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
-			if (migratetype_is_mergeable(mt))
+			if (migratetype_is_mergeable(mt)) {
 				set_pageblock_migratetype(page,
 							  MIGRATE_MOVABLE);
+				move_freepages_block(zone, page,
+						     MIGRATE_MOVABLE, NULL);
+			}
 		}
 	}
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
                   ` (2 preceding siblings ...)
  2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-11 20:23   ` Zi Yan
                     ` (2 more replies)
  2023-09-11 19:41 ` [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion Johannes Weiner
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

When a block is partially outside the zone of the cursor page, the
function cuts the range to the pivot page instead of the zone
start. This can leave large parts of the block behind, which
encourages incompatible page mixing down the line (ask for one type,
get another), and thus long-term fragmentation.

This triggers reliably on the first block in the DMA zone, whose
start_pfn is 1. The block is stolen, but everything before the pivot
page (which was often hundreds of pages) is left on the old list.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6f658c3d394..5bbe5f3be5ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1652,7 +1652,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
 
 	/* Do not cross zone boundaries */
 	if (!zone_spans_pfn(zone, start_pfn))
-		start_pfn = pfn;
+		start_pfn = zone->zone_start_pfn;
 	if (!zone_spans_pfn(zone, end_pfn))
 		return 0;
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
                   ` (3 preceding siblings ...)
  2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-13 19:52   ` Vlastimil Babka
  2023-09-11 19:41 ` [PATCH 6/6] mm: page_alloc: consolidate free page accounting Johannes Weiner
  2023-09-14 23:52 ` [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Mike Kravetz
  6 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

Currently, page block type conversion during fallbacks, atomic
reservations and isolation can strand various amounts of free pages on
incorrect freelists.

For example, fallback stealing moves free pages in the block to the
new type's freelists, but then may not actually claim the block for
that type if there aren't enough compatible pages already allocated.

In all cases, free page moving might fail if the block straddles more
than one zone, in which case no free pages are moved at all, but the
block type is changed anyway.

This is detrimental to type hygiene on the freelists. It encourages
incompatible page mixing down the line (ask for one type, get another)
and thus contributes to long-term fragmentation.

Split the process into a proper transaction: check first if conversion
will happen, then try to move the free pages, and only if that was
successful convert the block to the new type.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page-isolation.h |   3 +-
 mm/page_alloc.c                | 171 ++++++++++++++++++++-------------
 mm/page_isolation.c            |  22 +++--
 3 files changed, 118 insertions(+), 78 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 4ac34392823a..8550b3c91480 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,8 +34,7 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page,
-				int migratetype, int *num_movable);
+int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5bbe5f3be5ad..a902593f16dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1601,9 +1601,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
  * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
-static int move_freepages(struct zone *zone,
-			  unsigned long start_pfn, unsigned long end_pfn,
-			  int migratetype, int *num_movable)
+static int move_freepages(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int migratetype)
 {
 	struct page *page;
 	unsigned long pfn;
@@ -1613,14 +1612,6 @@ static int move_freepages(struct zone *zone,
 	for (pfn = start_pfn; pfn <= end_pfn;) {
 		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
-			/*
-			 * We assume that pages that could be isolated for
-			 * migration are movable. But we don't actually try
-			 * isolating, as that would be expensive.
-			 */
-			if (num_movable &&
-					(PageLRU(page) || __PageMovable(page)))
-				(*num_movable)++;
 			pfn++;
 			continue;
 		}
@@ -1638,26 +1629,62 @@ static int move_freepages(struct zone *zone,
 	return pages_moved;
 }
 
-int move_freepages_block(struct zone *zone, struct page *page,
-				int migratetype, int *num_movable)
+static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+				      unsigned long *start_pfn,
+				      unsigned long *end_pfn,
+				      int *num_free, int *num_movable)
 {
-	unsigned long start_pfn, end_pfn, pfn;
-
-	if (num_movable)
-		*num_movable = 0;
+	unsigned long pfn, start, end;
 
 	pfn = page_to_pfn(page);
-	start_pfn = pageblock_start_pfn(pfn);
-	end_pfn = pageblock_end_pfn(pfn) - 1;
+	start = pageblock_start_pfn(pfn);
+	end = pageblock_end_pfn(pfn) - 1;
 
 	/* Do not cross zone boundaries */
-	if (!zone_spans_pfn(zone, start_pfn))
-		start_pfn = zone->zone_start_pfn;
-	if (!zone_spans_pfn(zone, end_pfn))
-		return 0;
+	if (!zone_spans_pfn(zone, start))
+		start = zone->zone_start_pfn;
+	if (!zone_spans_pfn(zone, end))
+		return false;
+
+	*start_pfn = start;
+	*end_pfn = end;
+
+	if (num_free) {
+		*num_free = 0;
+		*num_movable = 0;
+		for (pfn = start; pfn <= end;) {
+			page = pfn_to_page(pfn);
+			if (PageBuddy(page)) {
+				int nr = 1 << buddy_order(page);
+
+				*num_free += nr;
+				pfn += nr;
+				continue;
+			}
+			/*
+			 * We assume that pages that could be isolated for
+			 * migration are movable. But we don't actually try
+			 * isolating, as that would be expensive.
+			 */
+			if (PageLRU(page) || __PageMovable(page))
+				(*num_movable)++;
+			pfn++;
+		}
+	}
 
-	return move_freepages(zone, start_pfn, end_pfn, migratetype,
-								num_movable);
+	return true;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page,
+			 int migratetype)
+{
+	unsigned long start_pfn, end_pfn;
+
+	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+				       NULL, NULL))
+		return -1;
+
+	return move_freepages(zone, start_pfn, end_pfn, migratetype);
 }
 
 static void change_pageblock_range(struct page *pageblock_page,
@@ -1742,33 +1769,36 @@ static inline bool boost_watermark(struct zone *zone)
 }
 
 /*
- * This function implements actual steal behaviour. If order is large enough,
- * we can steal whole pageblock. If not, we first move freepages in this
- * pageblock to our migratetype and determine how many already-allocated pages
- * are there in the pageblock with a compatible migratetype. If at least half
- * of pages are free or compatible, we can change migratetype of the pageblock
- * itself, so pages freed in the future will be put on the correct free list.
+ * This function implements actual steal behaviour. If order is large enough, we
+ * can claim the whole pageblock for the requested migratetype. If not, we check
+ * the pageblock for constituent pages; if at least half of the pages are free
+ * or compatible, we can still claim the whole block, so pages freed in the
+ * future will be put on the correct free list. Otherwise, we isolate exactly
+ * the order we need from the fallback block and leave its migratetype alone.
  */
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
-		unsigned int alloc_flags, int start_type, bool whole_block)
+				    int current_order, int order, int start_type,
+				    unsigned int alloc_flags, bool whole_block)
 {
-	unsigned int current_order = buddy_order(page);
 	int free_pages, movable_pages, alike_pages;
-	int old_block_type;
+	unsigned long start_pfn, end_pfn;
+	int block_type;
 
-	old_block_type = get_pageblock_migratetype(page);
+	block_type = get_pageblock_migratetype(page);
 
 	/*
 	 * This can happen due to races and we want to prevent broken
 	 * highatomic accounting.
 	 */
-	if (is_migrate_highatomic(old_block_type))
+	if (is_migrate_highatomic(block_type))
 		goto single_page;
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
+		del_page_from_free_list(page, zone, current_order);
 		change_pageblock_range(page, current_order, start_type);
-		goto single_page;
+		expand(zone, page, order, current_order, start_type);
+		return;
 	}
 
 	/*
@@ -1783,10 +1813,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	if (!whole_block)
 		goto single_page;
 
-	free_pages = move_freepages_block(zone, page, start_type,
-						&movable_pages);
 	/* moving whole block can fail due to zone boundary conditions */
-	if (!free_pages)
+	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+				       &free_pages, &movable_pages))
 		goto single_page;
 
 	/*
@@ -1804,7 +1833,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		 * vice versa, be conservative since we can't distinguish the
 		 * exact migratetype of non-movable pages.
 		 */
-		if (old_block_type == MIGRATE_MOVABLE)
+		if (block_type == MIGRATE_MOVABLE)
 			alike_pages = pageblock_nr_pages
 						- (free_pages + movable_pages);
 		else
@@ -1815,13 +1844,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	 * compatible migratability as our allocation, claim the whole block.
 	 */
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
-			page_group_by_mobility_disabled)
+			page_group_by_mobility_disabled) {
+		move_freepages(zone, start_pfn, end_pfn, start_type);
 		set_pageblock_migratetype(page, start_type);
-
-	return;
+		block_type = start_type;
+	}
 
 single_page:
-	move_to_free_list(page, zone, current_order, start_type);
+	del_page_from_free_list(page, zone, current_order);
+	expand(zone, page, order, current_order, block_type);
 }
 
 /*
@@ -1885,9 +1916,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
 	if (migratetype_is_mergeable(mt)) {
-		zone->nr_reserved_highatomic += pageblock_nr_pages;
-		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
-		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
+		if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) {
+			set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+			zone->nr_reserved_highatomic += pageblock_nr_pages;
+		}
 	}
 
 out_unlock:
@@ -1912,7 +1944,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 	struct zone *zone;
 	struct page *page;
 	int order;
-	bool ret;
+	int ret;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
 								ac->nodemask) {
@@ -1961,10 +1993,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
+			ret = move_freepages_block(zone, page, ac->migratetype);
+			/*
+			 * Reserving this block already succeeded, so this should
+			 * not fail on zone boundaries.
+			 */
+			WARN_ON_ONCE(ret == -1);
 			set_pageblock_migratetype(page, ac->migratetype);
-			ret = move_freepages_block(zone, page, ac->migratetype,
-									NULL);
-			if (ret) {
+			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return ret;
 			}
@@ -1985,7 +2021,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
  */
-static __always_inline bool
+static __always_inline struct page *
 __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 						unsigned int alloc_flags)
 {
@@ -2032,7 +2068,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 		goto do_steal;
 	}
 
-	return false;
+	return NULL;
 
 find_smallest:
 	for (current_order = order; current_order <= MAX_ORDER;
@@ -2053,14 +2089,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 do_steal:
 	page = get_page_from_free_area(area, fallback_mt);
 
-	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
-								can_steal);
+	/* take off list, maybe claim block, expand remainder */
+	steal_suitable_fallback(zone, page, current_order, order,
+				start_migratetype, alloc_flags, can_steal);
 
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
 
-	return true;
-
+	return page;
 }
 
 /*
@@ -2087,15 +2123,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 				return page;
 		}
 	}
-retry:
+
 	page = __rmqueue_smallest(zone, order, migratetype);
 	if (unlikely(!page)) {
 		if (alloc_flags & ALLOC_CMA)
 			page = __rmqueue_cma_fallback(zone, order);
-
-		if (!page && __rmqueue_fallback(zone, order, migratetype,
-								alloc_flags))
-			goto retry;
+		else
+			page = __rmqueue_fallback(zone, order, migratetype,
+						  alloc_flags);
 	}
 	return page;
 }
@@ -2548,12 +2583,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
-			if (migratetype_is_mergeable(mt)) {
-				set_pageblock_migratetype(page,
-							  MIGRATE_MOVABLE);
-				move_freepages_block(zone, page,
-						     MIGRATE_MOVABLE, NULL);
-			}
+			if (migratetype_is_mergeable(mt) &&
+			    move_freepages_block(zone, page,
+						 MIGRATE_MOVABLE) != -1)
+				set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 		}
 	}
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index bcf99ba747a0..cc48a3a52f00 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,15 +178,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		unsigned long nr_pages;
+		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
 
+		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+		/* Block spans zone boundaries? */
+		if (nr_pages == -1) {
+			spin_unlock_irqrestore(&zone->lock, flags);
+			return -EBUSY;
+		}
+		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
-									NULL);
-
-		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
 	}
@@ -206,7 +209,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 static void unset_migratetype_isolate(struct page *page, int migratetype)
 {
 	struct zone *zone;
-	unsigned long flags, nr_pages;
+	unsigned long flags;
 	bool isolated_page = false;
 	unsigned int order;
 	struct page *buddy;
@@ -252,7 +255,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		nr_pages = move_freepages_block(zone, page, migratetype, NULL);
+		int nr_pages = move_freepages_block(zone, page, migratetype);
+		/*
+		 * Isolating this block already succeeded, so this
+		 * should not fail on zone boundaries.
+		 */
+		WARN_ON_ONCE(nr_pages == -1);
 		__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	}
 	set_pageblock_migratetype(page, migratetype);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 6/6] mm: page_alloc: consolidate free page accounting
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
                   ` (4 preceding siblings ...)
  2023-09-11 19:41 ` [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion Johannes Weiner
@ 2023-09-11 19:41 ` Johannes Weiner
  2023-09-13 20:18   ` Vlastimil Babka
  2023-09-14 23:52 ` [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Mike Kravetz
  6 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 19:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

Free page accounting currently happens a bit too high up the call
stack, where it has to deal with guard pages, compaction capturing,
block stealing and even page isolation. This is subtle and fragile,
and makes it difficult to hack on the code.

Now that type violations on the freelists have been fixed, push the
accounting down to where pages enter and leave the freelist.

v3:
- fix CONFIG_UNACCEPTED_MEMORY build (lkp)
v2:
- fix CONFIG_DEBUG_PAGEALLOC build (Mel)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mm.h             |  18 ++---
 include/linux/page-isolation.h |   3 +-
 include/linux/vmstat.h         |   8 --
 mm/debug_page_alloc.c          |  12 +--
 mm/internal.h                  |   5 --
 mm/page_alloc.c                | 135 ++++++++++++++++++---------------
 mm/page_isolation.c            |   7 +-
 7 files changed, 90 insertions(+), 98 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..d8698248f280 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3680,24 +3680,22 @@ static inline bool page_is_guard(struct page *page)
 	return PageGuard(page);
 }
 
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
-		      int migratetype);
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order);
 static inline bool set_page_guard(struct zone *zone, struct page *page,
-				  unsigned int order, int migratetype)
+				  unsigned int order)
 {
 	if (!debug_guardpage_enabled())
 		return false;
-	return __set_page_guard(zone, page, order, migratetype);
+	return __set_page_guard(zone, page, order);
 }
 
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
-			int migratetype);
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order);
 static inline void clear_page_guard(struct zone *zone, struct page *page,
-				    unsigned int order, int migratetype)
+				    unsigned int order)
 {
 	if (!debug_guardpage_enabled())
 		return;
-	__clear_page_guard(zone, page, order, migratetype);
+	__clear_page_guard(zone, page, order);
 }
 
 #else	/* CONFIG_DEBUG_PAGEALLOC */
@@ -3707,9 +3705,9 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
 static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 static inline bool set_page_guard(struct zone *zone, struct page *page,
-			unsigned int order, int migratetype) { return false; }
+			unsigned int order) { return false; }
 static inline void clear_page_guard(struct zone *zone, struct page *page,
-				unsigned int order, int migratetype) {}
+				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
 #ifdef __HAVE_ARCH_GATE_AREA
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 8550b3c91480..901915747960 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,7 +34,8 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
+int move_freepages_block(struct zone *zone, struct page *page,
+			 int old_mt, int new_mt);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fed855bae6d8..a4eae03f6094 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -487,14 +487,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
 	mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
 }
 
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
-					     int migratetype)
-{
-	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
-	if (is_migrate_cma(migratetype))
-		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
 extern const char * const vmstat_text[];
 
 static inline const char *zone_stat_name(enum zone_stat_item item)
diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c
index f9d145730fd1..03a810927d0a 100644
--- a/mm/debug_page_alloc.c
+++ b/mm/debug_page_alloc.c
@@ -32,8 +32,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
 }
 early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
 
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
-		      int migratetype)
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order)
 {
 	if (order >= debug_guardpage_minorder())
 		return false;
@@ -41,19 +40,12 @@ bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
 	__SetPageGuard(page);
 	INIT_LIST_HEAD(&page->buddy_list);
 	set_page_private(page, order);
-	/* Guard pages are not available for any usage */
-	if (!is_migrate_isolate(migratetype))
-		__mod_zone_freepage_state(zone, -(1 << order), migratetype);
 
 	return true;
 }
 
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
-		      int migratetype)
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order)
 {
 	__ClearPageGuard(page);
-
 	set_page_private(page, 0);
-	if (!is_migrate_isolate(migratetype))
-		__mod_zone_freepage_state(zone, (1 << order), migratetype);
 }
diff --git a/mm/internal.h b/mm/internal.h
index 30cf724ddbce..d53b70e9cc3a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -883,11 +883,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype)
 	return migratetype == MIGRATE_HIGHATOMIC;
 }
 
-static inline bool is_migrate_highatomic_page(struct page *page)
-{
-	return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
-}
-
 void setup_zone_pageset(struct zone *zone);
 
 struct migration_target_control {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a902593f16dd..bfede72251d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -640,24 +640,36 @@ compaction_capture(struct capture_control *capc, struct page *page,
 }
 #endif /* CONFIG_COMPACTION */
 
-/* Used for pages not on another list */
-static inline void add_to_free_list(struct page *page, struct zone *zone,
-				    unsigned int order, int migratetype)
+static inline void account_freepages(struct page *page, struct zone *zone,
+				     int nr_pages, int migratetype)
 {
-	struct free_area *area = &zone->free_area[order];
+	if (is_migrate_isolate(migratetype))
+		return;
 
-	list_add(&page->buddy_list, &area->free_list[migratetype]);
-	area->nr_free++;
+	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+	if (is_migrate_cma(migratetype))
+		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
 }
 
 /* Used for pages not on another list */
-static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
-					 unsigned int order, int migratetype)
+static inline void add_to_free_list(struct page *page, struct zone *zone,
+				    unsigned int order, int migratetype,
+				    bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
+		     get_pageblock_migratetype(page), migratetype, 1 << order);
+
+	if (tail)
+		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+	else
+		list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
+
+	account_freepages(page, zone, 1 << order, migratetype);
 }
 
 /*
@@ -666,16 +678,28 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
  * allocation again (e.g., optimization for memory onlining).
  */
 static inline void move_to_free_list(struct page *page, struct zone *zone,
-				     unsigned int order, int migratetype)
+				     unsigned int order, int old_mt, int new_mt)
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
+	/* Free page moving can fail, so it happens before the type update */
+	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
+		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
+		     get_pageblock_migratetype(page), old_mt, 1 << order);
+
+	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+
+	account_freepages(page, zone, -(1 << order), old_mt);
+	account_freepages(page, zone, 1 << order, new_mt);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
-					   unsigned int order)
+					   unsigned int order, int migratetype)
 {
+        VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
+		     get_pageblock_migratetype(page), migratetype, 1 << order);
+
 	/* clear reported state and update reported page count */
 	if (page_reported(page))
 		__ClearPageReported(page);
@@ -684,6 +708,8 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	account_freepages(page, zone, -(1 << order), migratetype);
 }
 
 static inline struct page *get_page_from_free_area(struct free_area *area,
@@ -757,23 +783,21 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
 
 	VM_BUG_ON(migratetype == -1);
-	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
 	while (order < MAX_ORDER) {
-		if (compaction_capture(capc, page, order, migratetype)) {
-			__mod_zone_freepage_state(zone, -(1 << order),
-								migratetype);
+		int buddy_mt;
+
+		if (compaction_capture(capc, page, order, migratetype))
 			return;
-		}
 
 		buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
 		if (!buddy)
 			goto done_merging;
 
+		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
+
 		if (unlikely(order >= pageblock_order)) {
 			/*
 			 * We want to prevent merge between freepages on pageblock
@@ -801,9 +825,9 @@ static inline void __free_one_page(struct page *page,
 		 * merge with it and move up one order.
 		 */
 		if (page_is_guard(buddy))
-			clear_page_guard(zone, buddy, order, migratetype);
+			clear_page_guard(zone, buddy, order);
 		else
-			del_page_from_free_list(buddy, zone, order);
+			del_page_from_free_list(buddy, zone, order, buddy_mt);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -820,10 +844,7 @@ static inline void __free_one_page(struct page *page,
 	else
 		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
-	if (to_tail)
-		add_to_free_list_tail(page, zone, order, migratetype);
-	else
-		add_to_free_list(page, zone, order, migratetype);
+	add_to_free_list(page, zone, order, migratetype, to_tail);
 
 	/* Notify page reporting subsystem of freed page */
 	if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
@@ -865,10 +886,8 @@ int split_free_page(struct page *free_page,
 	}
 
 	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
-	if (likely(!is_migrate_isolate(mt)))
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
+	del_page_from_free_list(free_page, zone, order, mt);
 
-	del_page_from_free_list(free_page, zone, order);
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
@@ -1388,10 +1407,10 @@ static inline void expand(struct zone *zone, struct page *page,
 		 * Corresponding page table entries will not be touched,
 		 * pages will stay not present in virtual address space
 		 */
-		if (set_page_guard(zone, &page[size], high, migratetype))
+		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		add_to_free_list(&page[size], zone, high, migratetype);
+		add_to_free_list(&page[size], zone, high, migratetype, false);
 		set_buddy_order(&page[size], high);
 	}
 }
@@ -1561,7 +1580,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		del_page_from_free_list(page, zone, current_order);
+		del_page_from_free_list(page, zone, current_order, migratetype);
 		expand(zone, page, order, current_order, migratetype);
 		trace_mm_page_alloc_zone_locked(page, order, migratetype,
 				pcp_allowed_order(order) &&
@@ -1602,7 +1621,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
  * boundary. If alignment is required, use move_freepages_block()
  */
 static int move_freepages(struct zone *zone, unsigned long start_pfn,
-			  unsigned long end_pfn, int migratetype)
+			  unsigned long end_pfn, int old_mt, int new_mt)
 {
 	struct page *page;
 	unsigned long pfn;
@@ -1621,7 +1640,7 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = buddy_order(page);
-		move_to_free_list(page, zone, order, migratetype);
+		move_to_free_list(page, zone, order, old_mt, new_mt);
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -1676,7 +1695,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 }
 
 int move_freepages_block(struct zone *zone, struct page *page,
-			 int migratetype)
+			 int old_mt, int new_mt)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -1684,7 +1703,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
 				       NULL, NULL))
 		return -1;
 
-	return move_freepages(zone, start_pfn, end_pfn, migratetype);
+	return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt);
 }
 
 static void change_pageblock_range(struct page *pageblock_page,
@@ -1795,7 +1814,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
-		del_page_from_free_list(page, zone, current_order);
+		del_page_from_free_list(page, zone, current_order, block_type);
 		change_pageblock_range(page, current_order, start_type);
 		expand(zone, page, order, current_order, start_type);
 		return;
@@ -1845,13 +1864,13 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	 */
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
-		move_freepages(zone, start_pfn, end_pfn, start_type);
+		move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
 		set_pageblock_migratetype(page, start_type);
 		block_type = start_type;
 	}
 
 single_page:
-	del_page_from_free_list(page, zone, current_order);
+	del_page_from_free_list(page, zone, current_order, block_type);
 	expand(zone, page, order, current_order, block_type);
 }
 
@@ -1916,7 +1935,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
 	if (migratetype_is_mergeable(mt)) {
-		if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) {
+		if (move_freepages_block(zone, page,
+					 mt, MIGRATE_HIGHATOMIC) != -1) {
 			set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 			zone->nr_reserved_highatomic += pageblock_nr_pages;
 		}
@@ -1959,11 +1979,13 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order <= MAX_ORDER; order++) {
 			struct free_area *area = &(zone->free_area[order]);
+			int mt;
 
 			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
+			mt = get_pageblock_migratetype(page);
 			/*
 			 * In page freeing path, migratetype change is racy so
 			 * we can counter several free pages in a pageblock
@@ -1971,7 +1993,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * from highatomic to ac->migratetype. So we should
 			 * adjust the count once.
 			 */
-			if (is_migrate_highatomic_page(page)) {
+			if (is_migrate_highatomic(mt)) {
 				/*
 				 * It should never happen but changes to
 				 * locking could inadvertently allow a per-cpu
@@ -1993,7 +2015,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
-			ret = move_freepages_block(zone, page, ac->migratetype);
+			ret = move_freepages_block(zone, page, mt,
+						   ac->migratetype);
 			/*
 			 * Reserving this block already succeeded, so this should
 			 * not fail on zone boundaries.
@@ -2165,12 +2188,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
-		if (is_migrate_cma(get_pageblock_migratetype(page)))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
 	}
-
-	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	return i;
@@ -2565,11 +2583,9 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
 		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
 			return 0;
-
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
 	}
 
-	del_page_from_free_list(page, zone, order);
+	del_page_from_free_list(page, zone, order, mt);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -2584,7 +2600,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * with others)
 			 */
 			if (migratetype_is_mergeable(mt) &&
-			    move_freepages_block(zone, page,
+			    move_freepages_block(zone, page, mt,
 						 MIGRATE_MOVABLE) != -1)
 				set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 		}
@@ -2670,8 +2686,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 				return NULL;
 			}
 		}
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pageblock_migratetype(page));
 		spin_unlock_irqrestore(&zone->lock, flags);
 	} while (check_new_pages(page, order));
 
@@ -6434,8 +6448,9 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
+		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
-		del_page_from_free_list(page, zone, order);
+		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -6486,11 +6501,12 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 			current_buddy = page + size;
 		}
 
-		if (set_page_guard(zone, current_buddy, high, migratetype))
+		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
 		if (current_buddy != target) {
-			add_to_free_list(current_buddy, zone, high, migratetype);
+			add_to_free_list(current_buddy, zone, high,
+					 migratetype, false);
 			set_buddy_order(current_buddy, high);
 			page = next_page;
 		}
@@ -6518,12 +6534,11 @@ bool take_page_off_buddy(struct page *page)
 			int migratetype = get_pfnblock_migratetype(page_head,
 								   pfn_head);
 
-			del_page_from_free_list(page_head, zone, page_order);
+			del_page_from_free_list(page_head, zone, page_order,
+						migratetype);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
 			SetPageHWPoisonTakenOff(page);
-			if (!is_migrate_isolate(migratetype))
-				__mod_zone_freepage_state(zone, -1, migratetype);
 			ret = true;
 			break;
 		}
@@ -6630,7 +6645,7 @@ static bool try_to_accept_memory_one(struct zone *zone)
 	list_del(&page->lru);
 	last = list_empty(&zone->unaccepted_pages);
 
-	__mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+	account_freepages(page, zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
 	__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -6682,7 +6697,7 @@ static bool __free_unaccepted(struct page *page)
 	spin_lock_irqsave(&zone->lock, flags);
 	first = list_empty(&zone->unaccepted_pages);
 	list_add_tail(&page->lru, &zone->unaccepted_pages);
-	__mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+	account_freepages(page, zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
 	__mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index cc48a3a52f00..b5c7a9d21257 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -181,13 +181,12 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
 
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+		nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
 		/* Block spans zone boundaries? */
 		if (nr_pages == -1) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
@@ -255,13 +254,13 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		int nr_pages = move_freepages_block(zone, page, migratetype);
+		int nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
+						    migratetype);
 		/*
 		 * Isolating this block already succeeded, so this
 		 * should not fail on zone boundaries.
 		 */
 		WARN_ON_ONCE(nr_pages == -1);
-		__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	}
 	set_pageblock_migratetype(page, migratetype);
 	if (isolated_page)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
@ 2023-09-11 19:59   ` Zi Yan
  2023-09-11 21:09     ` Andrew Morton
  2023-09-12 13:47   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-11 19:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

On 11 Sep 2023, at 15:41, Johannes Weiner wrote:

> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
>
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term

s/whis/this

> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 61 ++++++++++++-------------------------------------
>  1 file changed, 14 insertions(+), 47 deletions(-)
>

Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks
  2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
@ 2023-09-11 20:01   ` Zi Yan
  2023-09-13  9:52   ` Vlastimil Babka
  2023-09-14 10:00   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-09-11 20:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On 11 Sep 2023, at 15:41, Johannes Weiner wrote:

> The buddy allocator coalesces compatible blocks during freeing, but it
> doesn't update the types of the subblocks to match. When an allocation
> later breaks the chunk down again, its pieces will be put on freelists
> of the wrong type. This encourages incompatible page mixing (ask for
> one type, get another), and thus long-term fragmentation.
>
> Update the subblocks when merging a larger chunk, such that a later
> expand() will maintain freelist type hygiene.
>
> v2:
> - remove spurious change_pageblock_range() move (Zi Yan)
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
@ 2023-09-11 20:17   ` Zi Yan
  2023-09-11 20:47     ` Johannes Weiner
  2023-09-13 14:31   ` Vlastimil Babka
  2023-09-14 10:03   ` Mel Gorman
  2 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-11 20:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]

On 11 Sep 2023, at 15:41, Johannes Weiner wrote:

> When claiming a block during compaction isolation, move any remaining
> free pages to the correct freelists as well, instead of stranding them
> on the wrong list. Otherwise, this encourages incompatible page mixing
> down the line, and thus long-term fragmentation.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db405414174..f6f658c3d394 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2548,9 +2548,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  			 * Only change normal pageblocks (i.e., they can merge
>  			 * with others)
>  			 */
> -			if (migratetype_is_mergeable(mt))
> +			if (migratetype_is_mergeable(mt)) {
>  				set_pageblock_migratetype(page,
>  							  MIGRATE_MOVABLE);
> +				move_freepages_block(zone, page,
> +						     MIGRATE_MOVABLE, NULL);
> +			}
>  		}
>  	}
>
> -- 
> 2.42.0

Is this needed? And is this correct?

__isolate_free_page() removes the free page from a free list, but the added
move_freepages_block() puts the page back to another free list, making
__isolate_free_page() not do its work. OK. the for loop is going through
the pages within the pageblock, so move_freepages_block() should be used
on the rest of the pages on the pageblock.

So to make this correct, the easies change might be move
del_page_from_free_list(page, zone, order) below this code chunk.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error
  2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
@ 2023-09-11 20:23   ` Zi Yan
  2023-09-13 14:40   ` Vlastimil Babka
  2023-09-14 10:03   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-09-11 20:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

On 11 Sep 2023, at 15:41, Johannes Weiner wrote:

> When a block is partially outside the zone of the cursor page, the
> function cuts the range to the pivot page instead of the zone
> start. This can leave large parts of the block behind, which
> encourages incompatible page mixing down the line (ask for one type,
> get another), and thus long-term fragmentation.
>
> This triggers reliably on the first block in the DMA zone, whose
> start_pfn is 1. The block is stolen, but everything before the pivot
> page (which was often hundreds of pages) is left on the old list.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 20:17   ` Zi Yan
@ 2023-09-11 20:47     ` Johannes Weiner
  2023-09-11 20:50       ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-11 20:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

On Mon, Sep 11, 2023 at 04:17:07PM -0400, Zi Yan wrote:
> On 11 Sep 2023, at 15:41, Johannes Weiner wrote:
> 
> > When claiming a block during compaction isolation, move any remaining
> > free pages to the correct freelists as well, instead of stranding them
> > on the wrong list. Otherwise, this encourages incompatible page mixing
> > down the line, and thus long-term fragmentation.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/page_alloc.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3db405414174..f6f658c3d394 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2548,9 +2548,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
> >  			 * Only change normal pageblocks (i.e., they can merge
> >  			 * with others)
> >  			 */
> > -			if (migratetype_is_mergeable(mt))
> > +			if (migratetype_is_mergeable(mt)) {
> >  				set_pageblock_migratetype(page,
> >  							  MIGRATE_MOVABLE);
> > +				move_freepages_block(zone, page,
> > +						     MIGRATE_MOVABLE, NULL);
> > +			}
> >  		}
> >  	}
> >
> > -- 
> > 2.42.0
> 
> Is this needed?

Yes, the problem is if we e.g. isolate half a block, then we'll
convert the type of the whole block but strand the half we're not
isolating. This can be a couple of hundred pages on the wrong list.

> And is this correct?
> 
> __isolate_free_page() removes the free page from a free list, but the added
> move_freepages_block() puts the page back to another free list, making
> __isolate_free_page() not do its work. OK. the for loop is going through
> the pages within the pageblock, so move_freepages_block() should be used
> on the rest of the pages on the pageblock.
> 
> So to make this correct, the easies change might be move
> del_page_from_free_list(page, zone, order) below this code chunk.

There is a del_page_from_freelist() just above this diff hunk. That
takes the page off the list and clears its PageBuddy.

move_freepages_block() will then move only the remainder of the block
that's still on the freelist with a mismatched type (move_freepages()
only moves buddies).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 20:47     ` Johannes Weiner
@ 2023-09-11 20:50       ` Zi Yan
  0 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-09-11 20:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2463 bytes --]

On 11 Sep 2023, at 16:47, Johannes Weiner wrote:

> On Mon, Sep 11, 2023 at 04:17:07PM -0400, Zi Yan wrote:
>> On 11 Sep 2023, at 15:41, Johannes Weiner wrote:
>>
>>> When claiming a block during compaction isolation, move any remaining
>>> free pages to the correct freelists as well, instead of stranding them
>>> on the wrong list. Otherwise, this encourages incompatible page mixing
>>> down the line, and thus long-term fragmentation.
>>>
>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>>  mm/page_alloc.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3db405414174..f6f658c3d394 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2548,9 +2548,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
>>>  			 * Only change normal pageblocks (i.e., they can merge
>>>  			 * with others)
>>>  			 */
>>> -			if (migratetype_is_mergeable(mt))
>>> +			if (migratetype_is_mergeable(mt)) {
>>>  				set_pageblock_migratetype(page,
>>>  							  MIGRATE_MOVABLE);
>>> +				move_freepages_block(zone, page,
>>> +						     MIGRATE_MOVABLE, NULL);
>>> +			}
>>>  		}
>>>  	}
>>>
>>> -- 
>>> 2.42.0
>>
>> Is this needed?
>
> Yes, the problem is if we e.g. isolate half a block, then we'll
> convert the type of the whole block but strand the half we're not
> isolating. This can be a couple of hundred pages on the wrong list.
>
>> And is this correct?
>>
>> __isolate_free_page() removes the free page from a free list, but the added
>> move_freepages_block() puts the page back to another free list, making
>> __isolate_free_page() not do its work. OK. the for loop is going through
>> the pages within the pageblock, so move_freepages_block() should be used
>> on the rest of the pages on the pageblock.
>>
>> So to make this correct, the easies change might be move
>> del_page_from_free_list(page, zone, order) below this code chunk.
>
> There is a del_page_from_freelist() just above this diff hunk. That
> takes the page off the list and clears its PageBuddy.
>
> move_freepages_block() will then move only the remainder of the block
> that's still on the freelist with a mismatched type (move_freepages()
> only moves buddies).

Ah, I missed __ClearPageBuddy() in del_page_from_free_list(). Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:59   ` Zi Yan
@ 2023-09-11 21:09     ` Andrew Morton
  0 siblings, 0 replies; 83+ messages in thread
From: Andrew Morton @ 2023-09-11 21:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: Johannes Weiner, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

On Mon, 11 Sep 2023 15:59:43 -0400 Zi Yan <ziy@nvidia.com> wrote:

> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> 
> s/whis/this

Thanks, I did s/whis/which/

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
  2023-09-11 19:59   ` Zi Yan
@ 2023-09-12 13:47   ` Vlastimil Babka
  2023-09-12 14:50     ` Johannes Weiner
  2023-09-12 15:03     ` Johannes Weiner
  2023-09-14  9:56   ` Mel Gorman
  2023-09-27  5:42   ` Huang, Ying
  3 siblings, 2 replies; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-12 13:47 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
> 
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> 
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> 
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> 
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> 
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.

Makes sense to me, so

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some notes below.

> @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  			continue;
>  		del_page_from_free_list(page, zone, current_order);
>  		expand(zone, page, order, current_order, migratetype);
> -		set_pcppage_migratetype(page, migratetype);

Hm interesting, just noticed that __rmqueue_fallback() never did this
AFAICS, sounds like a bug.

>  		trace_mm_page_alloc_zone_locked(page, order, migratetype,
>  				pcp_allowed_order(order) &&
>  				migratetype < MIGRATE_PCPTYPES);
> @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->pcp_list, list);
> -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> +		if (is_migrate_cma(get_pageblock_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));

This is potentially a source of overhead, I assume patch 6/6 might
change that.

>  	}
> @@ -2304,19 +2282,6 @@ void drain_all_pages(struct zone *zone)
>  	__drain_all_pages(zone, false);
>  }
>  
> -static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
> -							unsigned int order)
> -{
> -	int migratetype;
> -
> -	if (!free_pages_prepare(page, order, FPI_NONE))
> -		return false;
> -
> -	migratetype = get_pfnblock_migratetype(page, pfn);
> -	set_pcppage_migratetype(page, migratetype);
> -	return true;
> -}
> -
>  static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
>  {
>  	int min_nr_free, max_nr_free;
> @@ -2402,7 +2367,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype, pcpmigratetype;
>  
> -	if (!free_unref_page_prepare(page, pfn, order))
> +	if (!free_pages_prepare(page, order, FPI_NONE))
>  		return;
>  
>  	/*
> @@ -2412,7 +2377,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	 * get those areas back if necessary. Otherwise, we may have to free
>  	 * excessively into the page allocator
>  	 */
> -	migratetype = pcpmigratetype = get_pcppage_migratetype(page);
> +	migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
>  	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
>  		if (unlikely(is_migrate_isolate(migratetype))) {
>  			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
> @@ -2448,7 +2413,8 @@ void free_unref_page_list(struct list_head *list)
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
> -		if (!free_unref_page_prepare(page, pfn, 0)) {
> +
> +		if (!free_pages_prepare(page, 0, FPI_NONE)) {
>  			list_del(&page->lru);
>  			continue;
>  		}
> @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
>  		 * Free isolated pages directly to the allocator, see
>  		 * comment in free_unref_page.
>  		 */
> -		migratetype = get_pcppage_migratetype(page);
> +		migratetype = get_pfnblock_migratetype(page, pfn);
>  		if (unlikely(is_migrate_isolate(migratetype))) {
>  			list_del(&page->lru);
>  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);

I think after this change we should move the isolated pages handling to
the second loop below, so that we wouldn't have to call
get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
does that. It would need to unlock pcp when necessary.

> @@ -2466,10 +2432,11 @@ void free_unref_page_list(struct list_head *list)
>  	}
>  
>  	list_for_each_entry_safe(page, next, list, lru) {
> +		unsigned long pfn = page_to_pfn(page);
>  		struct zone *zone = page_zone(page);
>  
>  		list_del(&page->lru);
> -		migratetype = get_pcppage_migratetype(page);
> +		migratetype = get_pfnblock_migratetype(page, pfn);
>  
>  		/*
>  		 * Either different zone requiring a different pcp lock or
> @@ -2492,7 +2459,7 @@ void free_unref_page_list(struct list_head *list)
>  			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  			if (unlikely(!pcp)) {
>  				pcp_trylock_finish(UP_flags);
> -				free_one_page(zone, page, page_to_pfn(page),
> +				free_one_page(zone, page, pfn,
>  					      0, migratetype, FPI_NONE);
>  				locked_zone = NULL;
>  				continue;
> @@ -2661,7 +2628,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			}
>  		}
>  		__mod_zone_freepage_state(zone, -(1 << order),
> -					  get_pcppage_migratetype(page));
> +					  get_pageblock_migratetype(page));
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	} while (check_new_pages(page, order));
>  

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-12 13:47   ` Vlastimil Babka
@ 2023-09-12 14:50     ` Johannes Weiner
  2023-09-13  9:33       ` Vlastimil Babka
  2023-09-12 15:03     ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-12 14:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
> On 9/11/23 21:41, Johannes Weiner wrote:
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> > 
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> > 
> > Before:
> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> > 
> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> > 
> > After:
> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> > 
> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> > 
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
> 
> Makes sense to me, so
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >  			continue;
> >  		del_page_from_free_list(page, zone, current_order);
> >  		expand(zone, page, order, current_order, migratetype);
> > -		set_pcppage_migratetype(page, migratetype);
> 
> Hm interesting, just noticed that __rmqueue_fallback() never did this
> AFAICS, sounds like a bug.

I don't quite follow. Which part?

Keep in mind that at this point __rmqueue_fallback() doesn't return a
page. It just moves pages to the desired freelist, and then
__rmqueue_smallest() gets called again. This changes in 5/6, but until
now at least all of the above would apply to fallback pages.

> > @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >  		 * pages are ordered properly.
> >  		 */
> >  		list_add_tail(&page->pcp_list, list);
> > -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > +		if (is_migrate_cma(get_pageblock_migratetype(page)))
> >  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >  					      -(1 << order));
> 
> This is potentially a source of overhead, I assume patch 6/6 might
> change that.

Yes, 6/6 removes it altogether.

But the test results in this patch's changelog are from this patch in
isolation, so it doesn't appear to be a concern even on its own.

> > @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
> >  		 * Free isolated pages directly to the allocator, see
> >  		 * comment in free_unref_page.
> >  		 */
> > -		migratetype = get_pcppage_migratetype(page);
> > +		migratetype = get_pfnblock_migratetype(page, pfn);
> >  		if (unlikely(is_migrate_isolate(migratetype))) {
> >  			list_del(&page->lru);
> >  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> 
> I think after this change we should move the isolated pages handling to
> the second loop below, so that we wouldn't have to call
> get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
> does that. It would need to unlock pcp when necessary.

That sounds like a great idea. Something like the following?

Lightly tested. If you're good with it, I'll beat some more on it and
submit it as a follow-up.

---

From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 12 Sep 2023 10:16:10 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()

Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.

Minor comment and local variable scoping cleanups.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 49 +++++++++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9cad31de1bf5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2408,48 +2408,41 @@ void free_unref_page_list(struct list_head *list)
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
-	int migratetype;
-
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
 
-		if (!free_pages_prepare(page, 0, FPI_NONE)) {
+	list_for_each_entry_safe(page, next, list, lru)
+		if (!free_pages_prepare(page, 0, FPI_NONE))
 			list_del(&page->lru);
-			continue;
-		}
-
-		/*
-		 * Free isolated pages directly to the allocator, see
-		 * comment in free_unref_page.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
-			continue;
-		}
-	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
+		int migratetype;
 
 		list_del(&page->lru);
 		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * Zone switch, batch complete, or non-pcp freeing?
+		 * Drop the pcp lock and evaluate.
 		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		if (unlikely(zone != locked_zone ||
+			     batch_count == SWAP_CLUSTER_MAX ||
+			     is_migrate_isolate(migratetype))) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
+				locked_zone = NULL;
 			}
 
-			batch_count = 0;
+			/*
+			 * Free isolated pages directly to the
+			 * allocator, see comment in free_unref_page.
+			 */
+			if (is_migrate_isolate(migratetype)) {
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
+				continue;
+			}
 
 			/*
 			 * trylock is necessary as pages may be getting freed
@@ -2459,12 +2452,12 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, pfn,
-					      0, migratetype, FPI_NONE);
-				locked_zone = NULL;
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-12 13:47   ` Vlastimil Babka
  2023-09-12 14:50     ` Johannes Weiner
@ 2023-09-12 15:03     ` Johannes Weiner
  2023-09-14  7:29       ` Vlastimil Babka
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-12 15:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
> I think after this change we should [...]

Speaking of follow-ups, AFAICS we no longer need those either:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cad31de1bf5..bea499fbca58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1751,13 +1751,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 	old_block_type = get_pageblock_migratetype(page);
 
-	/*
-	 * This can happen due to races and we want to prevent broken
-	 * highatomic accounting.
-	 */
-	if (is_migrate_highatomic(old_block_type))
-		goto single_page;
-
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		change_pageblock_range(page, current_order, start_type);
@@ -1926,24 +1919,15 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				continue;
 
 			/*
-			 * In page freeing path, migratetype change is racy so
-			 * we can counter several free pages in a pageblock
-			 * in this loop although we changed the pageblock type
-			 * from highatomic to ac->migratetype. So we should
-			 * adjust the count once.
+			 * It should never happen but changes to
+			 * locking could inadvertently allow a per-cpu
+			 * drain to add pages to MIGRATE_HIGHATOMIC
+			 * while unreserving so be safe and watch for
+			 * underflows.
 			 */
-			if (is_migrate_highatomic_page(page)) {
-				/*
-				 * It should never happen but changes to
-				 * locking could inadvertently allow a per-cpu
-				 * drain to add pages to MIGRATE_HIGHATOMIC
-				 * while unreserving so be safe and watch for
-				 * underflows.
-				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
-			}
+			zone->nr_reserved_highatomic -= min(
+				pageblock_nr_pages,
+				zone->nr_reserved_highatomic);
 
 			/*
 			 * Convert to ac->migratetype and avoid the normal

I think they were only in place because we could change the highatomic
status of pages on the pcplist, and those pages would then end up on
some other freelist due to the stale pcppage cache.

I replaced them locally with WARNs and ran an hour or so of kernel
builds under pressure. It didn't trigger. So I would send a follow up
to remove them.

Unless you point me to a good reason why they're definitely still
needed - in which case this is a moot proposal - but then we should
make the comments more specific.

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-12 14:50     ` Johannes Weiner
@ 2023-09-13  9:33       ` Vlastimil Babka
  2023-09-13 13:24         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13  9:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On 9/12/23 16:50, Johannes Weiner wrote:
> On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
>> On 9/11/23 21:41, Johannes Weiner wrote:
> 
>> > @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>> >  			continue;
>> >  		del_page_from_free_list(page, zone, current_order);
>> >  		expand(zone, page, order, current_order, migratetype);
>> > -		set_pcppage_migratetype(page, migratetype);
>> 
>> Hm interesting, just noticed that __rmqueue_fallback() never did this
>> AFAICS, sounds like a bug.
> 
> I don't quite follow. Which part?
> 
> Keep in mind that at this point __rmqueue_fallback() doesn't return a
> page. It just moves pages to the desired freelist, and then
> __rmqueue_smallest() gets called again. This changes in 5/6, but until
> now at least all of the above would apply to fallback pages.

Yep, missed that "doesn't return a page", thanks.

>> > @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>> >  		 * pages are ordered properly.
>> >  		 */
>> >  		list_add_tail(&page->pcp_list, list);
>> > -		if (is_migrate_cma(get_pcppage_migratetype(page)))
>> > +		if (is_migrate_cma(get_pageblock_migratetype(page)))
>> >  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>> >  					      -(1 << order));
>> 
>> This is potentially a source of overhead, I assume patch 6/6 might
>> change that.
> 
> Yes, 6/6 removes it altogether.
> 
> But the test results in this patch's changelog are from this patch in
> isolation, so it doesn't appear to be a concern even on its own.
> 
>> > @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
>> >  		 * Free isolated pages directly to the allocator, see
>> >  		 * comment in free_unref_page.
>> >  		 */
>> > -		migratetype = get_pcppage_migratetype(page);
>> > +		migratetype = get_pfnblock_migratetype(page, pfn);
>> >  		if (unlikely(is_migrate_isolate(migratetype))) {
>> >  			list_del(&page->lru);
>> >  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
>> 
>> I think after this change we should move the isolated pages handling to
>> the second loop below, so that we wouldn't have to call
>> get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
>> does that. It would need to unlock pcp when necessary.
> 
> That sounds like a great idea. Something like the following?
> 
> Lightly tested. If you're good with it, I'll beat some more on it and
> submit it as a follow-up.
> 
> ---
> 
> From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 12 Sep 2023 10:16:10 -0400
> Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
> 
> Move direct freeing of isolated pages to the lock-breaking block in
> the second loop. This saves an unnecessary migratetype reassessment.
> 
> Minor comment and local variable scoping cleanups.

Looks like batch_count and locked_zone could be moved to the loop scope as well.

> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 49 +++++++++++++++++++++----------------------------
>  1 file changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3f1c777feed..9cad31de1bf5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2408,48 +2408,41 @@ void free_unref_page_list(struct list_head *list)
>  	struct per_cpu_pages *pcp = NULL;
>  	struct zone *locked_zone = NULL;
>  	int batch_count = 0;
> -	int migratetype;
> -
> -	/* Prepare pages for freeing */
> -	list_for_each_entry_safe(page, next, list, lru) {
> -		unsigned long pfn = page_to_pfn(page);
>  
> -		if (!free_pages_prepare(page, 0, FPI_NONE)) {
> +	list_for_each_entry_safe(page, next, list, lru)
> +		if (!free_pages_prepare(page, 0, FPI_NONE))
>  			list_del(&page->lru);
> -			continue;
> -		}
> -
> -		/*
> -		 * Free isolated pages directly to the allocator, see
> -		 * comment in free_unref_page.
> -		 */
> -		migratetype = get_pfnblock_migratetype(page, pfn);
> -		if (unlikely(is_migrate_isolate(migratetype))) {
> -			list_del(&page->lru);
> -			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> -			continue;
> -		}
> -	}
>  
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
>  		struct zone *zone = page_zone(page);
> +		int migratetype;
>  
>  		list_del(&page->lru);
>  		migratetype = get_pfnblock_migratetype(page, pfn);
>  
>  		/*
> -		 * Either different zone requiring a different pcp lock or
> -		 * excessive lock hold times when freeing a large list of
> -		 * pages.
> +		 * Zone switch, batch complete, or non-pcp freeing?
> +		 * Drop the pcp lock and evaluate.
>  		 */
> -		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
> +		if (unlikely(zone != locked_zone ||
> +			     batch_count == SWAP_CLUSTER_MAX ||
> +			     is_migrate_isolate(migratetype))) {
>  			if (pcp) {
>  				pcp_spin_unlock(pcp);
>  				pcp_trylock_finish(UP_flags);
> +				locked_zone = NULL;
>  			}
>  
> -			batch_count = 0;
> +			/*
> +			 * Free isolated pages directly to the
> +			 * allocator, see comment in free_unref_page.
> +			 */
> +			if (is_migrate_isolate(migratetype)) {
> +				free_one_page(zone, page, pfn, 0,
> +					      migratetype, FPI_NONE);
> +				continue;
> +			}
>  
>  			/*
>  			 * trylock is necessary as pages may be getting freed
> @@ -2459,12 +2452,12 @@ void free_unref_page_list(struct list_head *list)
>  			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  			if (unlikely(!pcp)) {
>  				pcp_trylock_finish(UP_flags);
> -				free_one_page(zone, page, pfn,
> -					      0, migratetype, FPI_NONE);
> -				locked_zone = NULL;
> +				free_one_page(zone, page, pfn, 0,
> +					      migratetype, FPI_NONE);
>  				continue;
>  			}
>  			locked_zone = zone;
> +			batch_count = 0;
>  		}
>  
>  		/*


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks
  2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
  2023-09-11 20:01   ` Zi Yan
@ 2023-09-13  9:52   ` Vlastimil Babka
  2023-09-14 10:00   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13  9:52 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> The buddy allocator coalesces compatible blocks during freeing, but it
> doesn't update the types of the subblocks to match. When an allocation
> later breaks the chunk down again, its pieces will be put on freelists
> of the wrong type. This encourages incompatible page mixing (ask for
> one type, get another), and thus long-term fragmentation.

Yeah why not. Sould be pretty rare as this only affects >=pageblock_order,
but then also the overhead in the otherwise hot function is limited to its
colder part.

> Update the subblocks when merging a larger chunk, such that a later
> expand() will maintain freelist type hygiene.
> 
> v2:
> - remove spurious change_pageblock_range() move (Zi Yan)
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3f1c777feed..3db405414174 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -783,10 +783,17 @@ static inline void __free_one_page(struct page *page,
>  			 */
>  			int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>  
> -			if (migratetype != buddy_mt
> -					&& (!migratetype_is_mergeable(migratetype) ||
> -						!migratetype_is_mergeable(buddy_mt)))
> -				goto done_merging;
> +			if (migratetype != buddy_mt) {
> +				if (!migratetype_is_mergeable(migratetype) ||
> +				    !migratetype_is_mergeable(buddy_mt))
> +					goto done_merging;
> +				/*
> +				 * Match buddy type. This ensures that
> +				 * an expand() down the line puts the
> +				 * sub-blocks on the right freelists.
> +				 */
> +				set_pageblock_migratetype(buddy, migratetype);
> +			}
>  		}
>  
>  		/*


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-13  9:33       ` Vlastimil Babka
@ 2023-09-13 13:24         ` Johannes Weiner
  2023-09-13 13:34           ` Vlastimil Babka
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-13 13:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

Hello Vlastimil,

On Wed, Sep 13, 2023 at 11:33:52AM +0200, Vlastimil Babka wrote:
> On 9/12/23 16:50, Johannes Weiner wrote:
> > From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Tue, 12 Sep 2023 10:16:10 -0400
> > Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
> > 
> > Move direct freeing of isolated pages to the lock-breaking block in
> > the second loop. This saves an unnecessary migratetype reassessment.
> > 
> > Minor comment and local variable scoping cleanups.
> 
> Looks like batch_count and locked_zone could be moved to the loop scope as well.

Hm they both maintain values over multiple iterations, so I don't
think that's possible. Am I missing something?

> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks! I'll send this out properly with your tag.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-13 13:24         ` Johannes Weiner
@ 2023-09-13 13:34           ` Vlastimil Babka
  0 siblings, 0 replies; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13 13:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On 9/13/23 15:24, Johannes Weiner wrote:
> Hello Vlastimil,
> 
> On Wed, Sep 13, 2023 at 11:33:52AM +0200, Vlastimil Babka wrote:
>> On 9/12/23 16:50, Johannes Weiner wrote:
>> > From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
>> > From: Johannes Weiner <hannes@cmpxchg.org>
>> > Date: Tue, 12 Sep 2023 10:16:10 -0400
>> > Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
>> > 
>> > Move direct freeing of isolated pages to the lock-breaking block in
>> > the second loop. This saves an unnecessary migratetype reassessment.
>> > 
>> > Minor comment and local variable scoping cleanups.
>> 
>> Looks like batch_count and locked_zone could be moved to the loop scope as well.
> 
> Hm they both maintain values over multiple iterations, so I don't
> think that's possible. Am I missing something?

True, disregard :D

>> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> 
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks! I'll send this out properly with your tag.

np!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
  2023-09-11 20:17   ` Zi Yan
@ 2023-09-13 14:31   ` Vlastimil Babka
  2023-09-14 10:03   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13 14:31 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> When claiming a block during compaction isolation, move any remaining
> free pages to the correct freelists as well, instead of stranding them
> on the wrong list. Otherwise, this encourages incompatible page mixing
> down the line, and thus long-term fragmentation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db405414174..f6f658c3d394 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2548,9 +2548,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  			 * Only change normal pageblocks (i.e., they can merge
>  			 * with others)
>  			 */
> -			if (migratetype_is_mergeable(mt))
> +			if (migratetype_is_mergeable(mt)) {
>  				set_pageblock_migratetype(page,
>  							  MIGRATE_MOVABLE);
> +				move_freepages_block(zone, page,
> +						     MIGRATE_MOVABLE, NULL);
> +			}
>  		}
>  	}
>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error
  2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
  2023-09-11 20:23   ` Zi Yan
@ 2023-09-13 14:40   ` Vlastimil Babka
  2023-09-14 13:37     ` Johannes Weiner
  2023-09-14 10:03   ` Mel Gorman
  2 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13 14:40 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> When a block is partially outside the zone of the cursor page, the
> function cuts the range to the pivot page instead of the zone
> start. This can leave large parts of the block behind, which
> encourages incompatible page mixing down the line (ask for one type,
> get another), and thus long-term fragmentation.
> 
> This triggers reliably on the first block in the DMA zone, whose
> start_pfn is 1. The block is stolen, but everything before the pivot
> page (which was often hundreds of pages) is left on the old list.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Note below:

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f6f658c3d394..5bbe5f3be5ad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1652,7 +1652,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
>  
>  	/* Do not cross zone boundaries */
>  	if (!zone_spans_pfn(zone, start_pfn))
> -		start_pfn = pfn;
> +		start_pfn = zone->zone_start_pfn;
>  	if (!zone_spans_pfn(zone, end_pfn))
>  		return 0;

Culdn't we also adjust end_pfn to zone_end_pfn() so we don't just ignore the
last half-pageblock for no good reason? (or am I missing any?)
Also would stop treating end_pfn as inclusive here and in move_freepages(),
it's rather uncommon.

>  


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion
  2023-09-11 19:41 ` [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion Johannes Weiner
@ 2023-09-13 19:52   ` Vlastimil Babka
  2023-09-14 14:47     ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13 19:52 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> Currently, page block type conversion during fallbacks, atomic
> reservations and isolation can strand various amounts of free pages on
> incorrect freelists.
> 
> For example, fallback stealing moves free pages in the block to the
> new type's freelists, but then may not actually claim the block for
> that type if there aren't enough compatible pages already allocated.
> 
> In all cases, free page moving might fail if the block straddles more
> than one zone, in which case no free pages are moved at all, but the
> block type is changed anyway.
> 
> This is detrimental to type hygiene on the freelists. It encourages
> incompatible page mixing down the line (ask for one type, get another)
> and thus contributes to long-term fragmentation.
> 
> Split the process into a proper transaction: check first if conversion
> will happen, then try to move the free pages, and only if that was
> successful convert the block to the new type.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

<snip>

> @@ -1638,26 +1629,62 @@ static int move_freepages(struct zone *zone,
>  	return pages_moved;
>  }
>  
> -int move_freepages_block(struct zone *zone, struct page *page,
> -				int migratetype, int *num_movable)
> +static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> +				      unsigned long *start_pfn,
> +				      unsigned long *end_pfn,
> +				      int *num_free, int *num_movable)
>  {
> -	unsigned long start_pfn, end_pfn, pfn;
> -
> -	if (num_movable)
> -		*num_movable = 0;
> +	unsigned long pfn, start, end;
>  
>  	pfn = page_to_pfn(page);
> -	start_pfn = pageblock_start_pfn(pfn);
> -	end_pfn = pageblock_end_pfn(pfn) - 1;
> +	start = pageblock_start_pfn(pfn);
> +	end = pageblock_end_pfn(pfn) - 1;

>  	/* Do not cross zone boundaries */
> -	if (!zone_spans_pfn(zone, start_pfn))
> -		start_pfn = zone->zone_start_pfn;
> -	if (!zone_spans_pfn(zone, end_pfn))
> -		return 0;
> +	if (!zone_spans_pfn(zone, start))
> +		start = zone->zone_start_pfn;
> +	if (!zone_spans_pfn(zone, end))
> +		return false;

This brings me back to my previous suggestion - if we update the end, won't
the whole "block straddles >1 zones" situation to check for go away?

Hm or is it actually done because we have a problem by representing
pageblock migratetype with multiple zones, since there's a single
pageblock_bitmap entry per the respective pageblock range of pfn's, so one
zone's migratetype could mess with other's? And now it matters if we want
100% match of freelist vs pageblock migratetype?
(I think even before this series it could have mattered for
MIGRATETYPE_ISOLATE, is it broken in those corner cases?)

But in that case we might not be detecting the situation properly for the
later of the two zones in a pageblock, because if start_pfn is not spanned
we adjust it and continue? Hmm...



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: consolidate free page accounting
  2023-09-11 19:41 ` [PATCH 6/6] mm: page_alloc: consolidate free page accounting Johannes Weiner
@ 2023-09-13 20:18   ` Vlastimil Babka
  2023-09-14  4:11     ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-13 20:18 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 9/11/23 21:41, Johannes Weiner wrote:
> Free page accounting currently happens a bit too high up the call
> stack, where it has to deal with guard pages, compaction capturing,
> block stealing and even page isolation. This is subtle and fragile,
> and makes it difficult to hack on the code.
> 
> Now that type violations on the freelists have been fixed, push the
> accounting down to where pages enter and leave the freelist.
> 
> v3:
> - fix CONFIG_UNACCEPTED_MEMORY build (lkp)
> v2:
> - fix CONFIG_DEBUG_PAGEALLOC build (Mel)
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

<snip>

>  
>  /* Used for pages not on another list */
> -static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
> -					 unsigned int order, int migratetype)
> +static inline void add_to_free_list(struct page *page, struct zone *zone,
> +				    unsigned int order, int migratetype,
> +				    bool tail)
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> -	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> +	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
> +		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
> +		     get_pageblock_migratetype(page), migratetype, 1 << order);

Ok, IIUC so you now assume pageblock migratetype is now matching freelist
placement at all times. This is a change from the previous treatment as a
heuristic that may be sometimes imprecise. Let's assume the previous patches
handled the deterministic reasons why those would deviate (modulo my concern
about pageblocks spanning multiple zones in reply to 5/6).

But unless I'm missing something, I don't think the possible race scenarios
were dealt with? Pageblock migratetype is set under zone->lock but there are
places that read it outside of zone->lock and then trust it to perform the
freelist placement. See for example __free_pages_ok(), or free_unref_page()
in the cases it calls free_one_page(). These determine pageblock migratetype
before taking the zone->lock. Only for has_isolate_pageblock() cases we are
more careful, because previously isolation was the only case where precision
was needed. So I think this warning is going to trigger?

> +
> +	if (tail)
> +		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> +	else
> +		list_add(&page->buddy_list, &area->free_list[migratetype]);
>  	area->nr_free++;
> +
> +	account_freepages(page, zone, 1 << order, migratetype);
>  }
>  
>  /*

<snip>

> @@ -757,23 +783,21 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
>  
>  	VM_BUG_ON(migratetype == -1);
> -	if (likely(!is_migrate_isolate(migratetype)))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  	while (order < MAX_ORDER) {
> -		if (compaction_capture(capc, page, order, migratetype)) {
> -			__mod_zone_freepage_state(zone, -(1 << order),
> -								migratetype);
> +		int buddy_mt;
> +
> +		if (compaction_capture(capc, page, order, migratetype))
>  			return;
> -		}
>  
>  		buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
>  		if (!buddy)
>  			goto done_merging;
>  
> +		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);

You should assume buddy_mt equals migratetype, no? It's the same assumption
as the VM_WARN_ONCE() I've discussed?

> +
>  		if (unlikely(order >= pageblock_order)) {

Only here buddy_mt can differ and the code in this block already handles that.

>  			/*
>  			 * We want to prevent merge between freepages on pageblock
> @@ -801,9 +825,9 @@ static inline void __free_one_page(struct page *page,
>  		 * merge with it and move up one order.
>  		 */
>  		if (page_is_guard(buddy))
> -			clear_page_guard(zone, buddy, order, migratetype);
> +			clear_page_guard(zone, buddy, order);
>  		else
> -			del_page_from_free_list(buddy, zone, order);
> +			del_page_from_free_list(buddy, zone, order, buddy_mt);

Ugh so this will add account_freepages() call to each iteration of the
__free_one_page() hot loop, which seems like a lot of unnecessary overhead -
as long as we are within pageblock_order the migratetype should be the same,
and thus also is_migrate_isolate() and is_migrate_cma() tests should return
the same value so we shouldn't need to call __mod_zone_page_state()
piecemeal like this.

>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] mm: page_alloc: consolidate free page accounting
  2023-09-13 20:18   ` Vlastimil Babka
@ 2023-09-14  4:11     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-14  4:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Wed, Sep 13, 2023 at 10:18:17PM +0200, Vlastimil Babka wrote:
> Pageblock migratetype is set under zone->lock but there are
> places that read it outside of zone->lock and then trust it to perform the
> freelist placement. See for example __free_pages_ok(), or free_unref_page()
> in the cases it calls free_one_page(). These determine pageblock migratetype
> before taking the zone->lock. Only for has_isolate_pageblock() cases we are
> more careful, because previously isolation was the only case where precision
> was needed. So I think this warning is going to trigger?

Good catch on these two. It didn't get the warning, but the code is
indeed not quite right. The fix looks straight-forward: move the
lookup in those two cases into the zone->lock.

__free_pages_ok() is used

- when freeing >COSTLY order pages that aren't THPs
- when bootstrapping the allocator
- when allocating with alloc_pages_exact()
- when "accepting" unaccepted vm host memory before first use

none of which are too hot to tolerate the bitmap access inside the
lock instead of outside. free_one_page() is used in the isolated pages
and pcp-contended paths, both of which are exceptions as well.

Plus, it saves two branches (under the lock) in those paths to test
for the isolate conditions.

And a double lookup in case there *is* isolation.

I checked the code again and didn't see any other instances like the
two here. But I'll double check tomorrow and then send a fix.

> > @@ -757,23 +783,21 @@ static inline void __free_one_page(struct page *page,
> >  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> >  
> >  	VM_BUG_ON(migratetype == -1);
> > -	if (likely(!is_migrate_isolate(migratetype)))
> > -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> > -
> >  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> >  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> >  
> >  	while (order < MAX_ORDER) {
> > -		if (compaction_capture(capc, page, order, migratetype)) {
> > -			__mod_zone_freepage_state(zone, -(1 << order),
> > -								migratetype);
> > +		int buddy_mt;
> > +
> > +		if (compaction_capture(capc, page, order, migratetype))
> >  			return;
> > -		}
> >  
> >  		buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
> >  		if (!buddy)
> >  			goto done_merging;
> >  
> > +		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
> 
> You should assume buddy_mt equals migratetype, no? It's the same assumption
> as the VM_WARN_ONCE() I've discussed?

Ah, you're right, that lookup can be removed.

Actually, that section is brainfarts. There is an issue with updating
the buddy type before removing it from the list. I was confused why
this didn't warn for me, but it's because on my test setup, the
pageblock_order == MAX_ORDER since I don't have hugetlb compiled in.

The fix looks simple. I'll test it, with pb order 9 as well, and
follow-up.

> > @@ -801,9 +825,9 @@ static inline void __free_one_page(struct page *page,
> >  		 * merge with it and move up one order.
> >  		 */
> >  		if (page_is_guard(buddy))
> > -			clear_page_guard(zone, buddy, order, migratetype);
> > +			clear_page_guard(zone, buddy, order);
> >  		else
> > -			del_page_from_free_list(buddy, zone, order);
> > +			del_page_from_free_list(buddy, zone, order, buddy_mt);
> 
> Ugh so this will add account_freepages() call to each iteration of the
> __free_one_page() hot loop, which seems like a lot of unnecessary overhead -
> as long as we are within pageblock_order the migratetype should be the same,
> and thus also is_migrate_isolate() and is_migrate_cma() tests should return
> the same value so we shouldn't need to call __mod_zone_page_state()
> piecemeal like this.

Good point, this is unnecessarily naive. The net effect on the
counters from the buddy merging is nil. That's true even when we go
beyond the page block, because we don't merge isolated/cma blocks with
anything except their own.

I'll move the accounting out into a single call.

Thanks for your thorough review!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-12 15:03     ` Johannes Weiner
@ 2023-09-14  7:29       ` Vlastimil Babka
  0 siblings, 0 replies; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-14  7:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On 9/12/23 17:03, Johannes Weiner wrote:
> On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
>> I think after this change we should [...]
> 
> Speaking of follow-ups, AFAICS we no longer need those either:

Seems so, but the comments do talk about races, so once those are sorted out :)

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9cad31de1bf5..bea499fbca58 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1751,13 +1751,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  
>  	old_block_type = get_pageblock_migratetype(page);
>  
> -	/*
> -	 * This can happen due to races and we want to prevent broken
> -	 * highatomic accounting.
> -	 */
> -	if (is_migrate_highatomic(old_block_type))
> -		goto single_page;
> -
>  	/* Take ownership for orders >= pageblock_order */
>  	if (current_order >= pageblock_order) {
>  		change_pageblock_range(page, current_order, start_type);
> @@ -1926,24 +1919,15 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  				continue;
>  
>  			/*
> -			 * In page freeing path, migratetype change is racy so
> -			 * we can counter several free pages in a pageblock
> -			 * in this loop although we changed the pageblock type
> -			 * from highatomic to ac->migratetype. So we should
> -			 * adjust the count once.
> +			 * It should never happen but changes to
> +			 * locking could inadvertently allow a per-cpu
> +			 * drain to add pages to MIGRATE_HIGHATOMIC
> +			 * while unreserving so be safe and watch for
> +			 * underflows.
>  			 */
> -			if (is_migrate_highatomic_page(page)) {
> -				/*
> -				 * It should never happen but changes to
> -				 * locking could inadvertently allow a per-cpu
> -				 * drain to add pages to MIGRATE_HIGHATOMIC
> -				 * while unreserving so be safe and watch for
> -				 * underflows.
> -				 */
> -				zone->nr_reserved_highatomic -= min(
> -						pageblock_nr_pages,
> -						zone->nr_reserved_highatomic);
> -			}
> +			zone->nr_reserved_highatomic -= min(
> +				pageblock_nr_pages,
> +				zone->nr_reserved_highatomic);
>  
>  			/*
>  			 * Convert to ac->migratetype and avoid the normal
> 
> I think they were only in place because we could change the highatomic
> status of pages on the pcplist, and those pages would then end up on
> some other freelist due to the stale pcppage cache.
> 
> I replaced them locally with WARNs and ran an hour or so of kernel
> builds under pressure. It didn't trigger. So I would send a follow up
> to remove them.
> 
> Unless you point me to a good reason why they're definitely still
> needed - in which case this is a moot proposal - but then we should
> make the comments more specific.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
  2023-09-11 19:59   ` Zi Yan
  2023-09-12 13:47   ` Vlastimil Babka
@ 2023-09-14  9:56   ` Mel Gorman
  2023-09-27  5:42   ` Huang, Ying
  3 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2023-09-14  9:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Mon, Sep 11, 2023 at 03:41:42PM -0400, Johannes Weiner wrote:
> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
> 
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> 
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> 
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> 
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> 
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I've no specific objection and other minor corrections have already been
suggested. I don't recall specifically but I think
get_pageblock_migratetype might have been called redundantly once upon a
time when there were concerns about page allocator overhead for high
speed network. Now that there is bulk allocation and the flow has
changed significantly, it's feasible to simply avoid calling
get_pageblock_migratetype unnecessarily.

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks
  2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
  2023-09-11 20:01   ` Zi Yan
  2023-09-13  9:52   ` Vlastimil Babka
@ 2023-09-14 10:00   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2023-09-14 10:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Mon, Sep 11, 2023 at 03:41:43PM -0400, Johannes Weiner wrote:
> The buddy allocator coalesces compatible blocks during freeing, but it
> doesn't update the types of the subblocks to match. When an allocation
> later breaks the chunk down again, its pieces will be put on freelists
> of the wrong type. This encourages incompatible page mixing (ask for
> one type, get another), and thus long-term fragmentation.
> 
> Update the subblocks when merging a larger chunk, such that a later
> expand() will maintain freelist type hygiene.
> 
> v2:
> - remove spurious change_pageblock_range() move (Zi Yan)
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I'm not 100% convinced on the amount of harm this causes but given that
it's a relatively rare condition, I didn't think about the consequences
too deeply. The patch certainly has merit so;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation
  2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
  2023-09-11 20:17   ` Zi Yan
  2023-09-13 14:31   ` Vlastimil Babka
@ 2023-09-14 10:03   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2023-09-14 10:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Mon, Sep 11, 2023 at 03:41:44PM -0400, Johannes Weiner wrote:
> When claiming a block during compaction isolation, move any remaining
> free pages to the correct freelists as well, instead of stranding them
> on the wrong list. Otherwise, this encourages incompatible page mixing
> down the line, and thus long-term fragmentation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hmm, this is potentially expensive in some cases but it's also correct.
Given how expensive the whole path is, I doubt it's noticable and some of
this activity will be !direct_compaction anyway and relatively invisible
even if I'm not a fan of hiding overhead in kthreads. Either way;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error
  2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
  2023-09-11 20:23   ` Zi Yan
  2023-09-13 14:40   ` Vlastimil Babka
@ 2023-09-14 10:03   ` Mel Gorman
  2 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2023-09-14 10:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Mon, Sep 11, 2023 at 03:41:45PM -0400, Johannes Weiner wrote:
> When a block is partially outside the zone of the cursor page, the
> function cuts the range to the pivot page instead of the zone
> start. This can leave large parts of the block behind, which
> encourages incompatible page mixing down the line (ask for one type,
> get another), and thus long-term fragmentation.
> 
> This triggers reliably on the first block in the DMA zone, whose
> start_pfn is 1. The block is stolen, but everything before the pivot
> page (which was often hundreds of pages) is left on the old list.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Oops

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error
  2023-09-13 14:40   ` Vlastimil Babka
@ 2023-09-14 13:37     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-14 13:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Wed, Sep 13, 2023 at 04:40:48PM +0200, Vlastimil Babka wrote:
> On 9/11/23 21:41, Johannes Weiner wrote:
> > When a block is partially outside the zone of the cursor page, the
> > function cuts the range to the pivot page instead of the zone
> > start. This can leave large parts of the block behind, which
> > encourages incompatible page mixing down the line (ask for one type,
> > get another), and thus long-term fragmentation.
> > 
> > This triggers reliably on the first block in the DMA zone, whose
> > start_pfn is 1. The block is stolen, but everything before the pivot
> > page (which was often hundreds of pages) is left on the old list.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > @@ -1652,7 +1652,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
> >  
> >  	/* Do not cross zone boundaries */
> >  	if (!zone_spans_pfn(zone, start_pfn))
> > -		start_pfn = pfn;
> > +		start_pfn = zone->zone_start_pfn;
> >  	if (!zone_spans_pfn(zone, end_pfn))
> >  		return 0;
> 
> Culdn't we also adjust end_pfn to zone_end_pfn() so we don't just ignore the
> last half-pageblock for no good reason? (or am I missing any?)
> Also would stop treating end_pfn as inclusive here and in move_freepages(),
> it's rather uncommon.

You raise a good point here and in the reply to 5/6. Let me reply to
the other email.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion
  2023-09-13 19:52   ` Vlastimil Babka
@ 2023-09-14 14:47     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-14 14:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On Wed, Sep 13, 2023 at 09:52:17PM +0200, Vlastimil Babka wrote:
> On 9/11/23 21:41, Johannes Weiner wrote:
> > @@ -1638,26 +1629,62 @@ static int move_freepages(struct zone *zone,
> >  	return pages_moved;
> >  }
> >  
> > -int move_freepages_block(struct zone *zone, struct page *page,
> > -				int migratetype, int *num_movable)
> > +static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> > +				      unsigned long *start_pfn,
> > +				      unsigned long *end_pfn,
> > +				      int *num_free, int *num_movable)
> >  {
> > -	unsigned long start_pfn, end_pfn, pfn;
> > -
> > -	if (num_movable)
> > -		*num_movable = 0;
> > +	unsigned long pfn, start, end;
> >  
> >  	pfn = page_to_pfn(page);
> > -	start_pfn = pageblock_start_pfn(pfn);
> > -	end_pfn = pageblock_end_pfn(pfn) - 1;
> > +	start = pageblock_start_pfn(pfn);
> > +	end = pageblock_end_pfn(pfn) - 1;
> 
> >  	/* Do not cross zone boundaries */
> > -	if (!zone_spans_pfn(zone, start_pfn))
> > -		start_pfn = zone->zone_start_pfn;
> > -	if (!zone_spans_pfn(zone, end_pfn))
> > -		return 0;
> > +	if (!zone_spans_pfn(zone, start))
> > +		start = zone->zone_start_pfn;
> > +	if (!zone_spans_pfn(zone, end))
> > +		return false;
> 
> This brings me back to my previous suggestion - if we update the end, won't
> the whole "block straddles >1 zones" situation to check for go away?
> 
> Hm or is it actually done because we have a problem by representing
> pageblock migratetype with multiple zones, since there's a single
> pageblock_bitmap entry per the respective pageblock range of pfn's, so one
> zone's migratetype could mess with other's? And now it matters if we want
> 100% match of freelist vs pageblock migratetype?

Yes, it's not safe to change a shared bitmap entry with only one of
the two zones locked.

So I think my range adjustment isn't a complete fix. It's okay for the
case I was directly encountering, where DMA starts with pfn 1 and pfn
0 belongs to nobody. But if the block straddles two genuine zones, a
race is possible.

> (I think even before this series it could have mattered for
> MIGRATETYPE_ISOLATE, is it broken in those corner cases?)

Yes, I think this is buggy indeed.

start_isolate_page_range() calls isolate_single_pageblock() on block
boundaries. It actually does round up to the zone start if the pfn is
below it, since b2c9e2fbba32 ("mm: make alloc_contig_range work at
pageblock granularity") from Zi last year. But it will still set the
migratetype on a straddling block.

And I don't see any handling for the end of the block being in another
zone. It won't move free pages due to the above, but it appears to set
the isolate migratetype in an unlocked zone.

Since nobody has complained about this, I wonder if blocks truly
straddling two different zones isn't just rare but actually
non-existent. The DMA and DMA32 boundaries should naturally align to
multiples of the pageblock order, but there might be exceptions with
ZONE_MOVABLE. Maybe somebody remembers situations where this occurs?

> But in that case we might not be detecting the situation properly for the
> later of the two zones in a pageblock, because if start_pfn is not spanned
> we adjust it and continue? Hmm...

I think what needs to happen is return false in both cases and reject
operation on blocks whose pages are in two different zones. None of
the callers expect it, and don't hold both zone locks that would be
necessary to safely move pages and adjust the migratetype.

This would fix the isolate race, as well as the freelist race that
this series is trying to eliminate.

It would mean that a straddling block can still be stolen from during
fallback, but cannot be claimed entirely and will stay MOVABLE.

It's not perfect, but certainly sounds a lot more reasonable than a
double zone locking scheme for all callers.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
                   ` (5 preceding siblings ...)
  2023-09-11 19:41 ` [PATCH 6/6] mm: page_alloc: consolidate free page accounting Johannes Weiner
@ 2023-09-14 23:52 ` Mike Kravetz
  2023-09-15 14:16   ` Johannes Weiner
  6 siblings, 1 reply; 83+ messages in thread
From: Mike Kravetz @ 2023-09-14 23:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

In next-20230913, I started hitting the following BUG.  Seems related
to this series.  And, if series is reverted I do not see the BUG.

I can easily reproduce on a small 16G VM.  kernel command line contains
"hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

For the BUG below I believe it was the first (or second) 1G page creation from
CMA that triggered:  cma_alloc of 1G.

Sorry, have not looked deeper into the issue.

[   28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[   28.645455] flags: 0x200000000000000(node=0|zone=2)
[   28.646835] page_type: 0xffffffff()
[   28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[   28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[   28.654769] ------------[ cut here ]------------
[   28.655972] kernel BUG at mm/page_alloc.c:1231!
[   28.657139] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   28.658354] CPU: 2 PID: 885 Comm: bash Not tainted 6.6.0-rc1-next-20230913+ #3
[   28.660090] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[   28.662054] RIP: 0010:free_pcppages_bulk+0x192/0x240
[   28.663284] Code: 22 48 89 45 08 8b 44 24 0c 41 29 44 24 04 41 29 c6 41 83 f8 05 0f 85 4c ff ff ff 48 c7 c6 20 a5 22 82 48 89 df e8 4e cf fc ff <0f> 0b 65 8b 05 41 8b d3 7e 89 c0 48 0f a3 05 fb 35 39 01 0f 83 40
[   28.667422] RSP: 0018:ffffc90003b9faf0 EFLAGS: 00010046
[   28.668643] RAX: 000000000000003b RBX: ffffea0004fb4280 RCX: 0000000000000000
[   28.670245] RDX: 0000000000000000 RSI: ffffffff8224dace RDI: 00000000ffffffff
[   28.671920] RBP: ffffea0004fb4288 R08: 0000000000009ffb R09: 00000000ffffdfff
[   28.673614] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff888477c30540
[   28.675213] R13: ffff888477c30550 R14: 00000000000012f5 R15: 000000000013ed0a
[   28.676832] FS:  00007f60039b9740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[   28.678709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.680046] CR2: 00005615f9bf3048 CR3: 00000003128b6005 CR4: 0000000000370ee0
[   28.682897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.684501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   28.686098] Call Trace:
[   28.686792]  <TASK>
[   28.687414]  ? die+0x32/0x80
[   28.688197]  ? do_trap+0xd6/0x100
[   28.689069]  ? free_pcppages_bulk+0x192/0x240
[   28.690135]  ? do_error_trap+0x6a/0x90
[   28.691082]  ? free_pcppages_bulk+0x192/0x240
[   28.692187]  ? exc_invalid_op+0x49/0x60
[   28.693154]  ? free_pcppages_bulk+0x192/0x240
[   28.694225]  ? asm_exc_invalid_op+0x16/0x20
[   28.695291]  ? free_pcppages_bulk+0x192/0x240
[   28.696405]  drain_pages_zone+0x3f/0x50
[   28.697404]  __drain_all_pages+0xe2/0x1e0
[   28.698472]  alloc_contig_range+0x143/0x280
[   28.699581]  ? bitmap_find_next_zero_area_off+0x3d/0x90
[   28.700902]  cma_alloc+0x156/0x470
[   28.701852]  ? kernfs_fop_write_iter+0x160/0x1f0
[   28.703053]  alloc_fresh_hugetlb_folio+0x7e/0x270
[   28.704272]  alloc_pool_huge_page+0x7d/0x100
[   28.705448]  set_max_huge_pages+0x162/0x390
[   28.706530]  nr_hugepages_store_common+0x91/0xf0
[   28.707689]  kernfs_fop_write_iter+0x108/0x1f0
[   28.708819]  vfs_write+0x207/0x400
[   28.709743]  ksys_write+0x63/0xe0
[   28.710640]  do_syscall_64+0x37/0x90
[   28.712649]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   28.713919] RIP: 0033:0x7f6003aade87
[   28.714879] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   28.719096] RSP: 002b:00007ffdfd9d2e98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   28.720945] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6003aade87
[   28.722626] RDX: 0000000000000002 RSI: 00005615f9bac620 RDI: 0000000000000001
[   28.724288] RBP: 00005615f9bac620 R08: 000000000000000a R09: 00007f6003b450c0
[   28.725939] R10: 00007f6003b44fc0 R11: 0000000000000246 R12: 0000000000000002
[   28.727611] R13: 00007f6003b81520 R14: 0000000000000002 R15: 00007f6003b81720
[   28.729285]  </TASK>
[   28.729944] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs joydev snd_pcm snd_timer 9pnet_virtio snd soundcore virtio_balloon 9pnet virtio_console virtio_net virtio_blk net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel virtio_pci ghash_clmulni_intel serio_raw virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[   28.739325] ---[ end trace 0000000000000000 ]---

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-14 23:52 ` [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Mike Kravetz
@ 2023-09-15 14:16   ` Johannes Weiner
  2023-09-15 15:05     ` Mike Kravetz
                       ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-15 14:16 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> In next-20230913, I started hitting the following BUG.  Seems related
> to this series.  And, if series is reverted I do not see the BUG.
> 
> I can easily reproduce on a small 16G VM.  kernel command line contains
> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
> 
> For the BUG below I believe it was the first (or second) 1G page creation from
> CMA that triggered:  cma_alloc of 1G.
> 
> Sorry, have not looked deeper into the issue.

Thanks for the report, and sorry about the breakage!

I was scratching my head at this:

                        /* MIGRATE_ISOLATE page should not go to pcplists */
                        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

because there is nothing in page isolation that prevents setting
MIGRATE_ISOLATE on something that's on the pcplist already. So why
didn't this trigger before already?

Then it clicked: it used to only check the *pcpmigratetype* determined
by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.

Pages that get isolated while *already* on the pcplist are fine, and
are handled properly:

                        mt = get_pcppage_migratetype(page);

                        /* MIGRATE_ISOLATE page should not go to pcplists */
                        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

                        /* Pageblock could have been isolated meanwhile */
                        if (unlikely(isolated_pageblocks))
                                mt = get_pageblock_migratetype(page);

So this was purely a sanity check against the pcpmigratetype cache
operations. With that gone, we can remove it.

---

From b0cb92ed10b40fab0921002effa8b726df245790 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 09:59:52 -0400
Subject: [PATCH] mm: page_alloc: remove pcppage migratetype caching fix

Mike reports the following crash in -next:

[   28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[   28.645455] flags: 0x200000000000000(node=0|zone=2)
[   28.646835] page_type: 0xffffffff()
[   28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[   28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[   28.654769] ------------[ cut here ]------------
[   28.655972] kernel BUG at mm/page_alloc.c:1231!

This VM_BUG_ON() used to check that the cached pcppage_migratetype set
by free_unref_page() wasn't MIGRATE_ISOLATE.

When I removed the caching, I erroneously changed the assert to check
that no isolated pages are on the pcplist. This is quite different,
because pages can be isolated *after* they had been put on the
freelist already (which is handled just fine).

IOW, this was purely a sanity check on the migratetype caching. With
that gone, the check should have been removed as well. Do that now.

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9469e4660b53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1207,9 +1207,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			/* MIGRATE_ISOLATE page should not go to pcplists */
-			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-
 			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-15 14:16   ` Johannes Weiner
@ 2023-09-15 15:05     ` Mike Kravetz
  2023-09-16 19:57     ` Mike Kravetz
  2023-09-18  7:07     ` Vlastimil Babka
  2 siblings, 0 replies; 83+ messages in thread
From: Mike Kravetz @ 2023-09-15 15:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG.  Seems related
> > to this series.  And, if series is reverted I do not see the BUG.
> > 
> > I can easily reproduce on a small 16G VM.  kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered:  cma_alloc of 1G.
> > 
> > Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Thanks!  That makes sense.

Glad my testing (for something else) triggered it.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-15 14:16   ` Johannes Weiner
  2023-09-15 15:05     ` Mike Kravetz
@ 2023-09-16 19:57     ` Mike Kravetz
  2023-09-16 20:13       ` Andrew Morton
  2023-09-18  7:16       ` Vlastimil Babka
  2023-09-18  7:07     ` Vlastimil Babka
  2 siblings, 2 replies; 83+ messages in thread
From: Mike Kravetz @ 2023-09-16 19:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG.  Seems related
> > to this series.  And, if series is reverted I do not see the BUG.
> > 
> > I can easily reproduce on a small 16G VM.  kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered:  cma_alloc of 1G.
> > 
> > Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

With the patch below applied, a slightly different workload triggers the
following warnings.  It seems related, and appears to go away when
reverting the series.

[  331.595382] ------------[ cut here ]------------
[  331.596665] page type is 5, passed migratetype is 1 (nr=512)
[  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
[  331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
[  331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  331.613527] RIP: 0010:expand+0x1c9/0x200
[  331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
[  331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
[  331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[  331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
[  331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
[  331.627810] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  331.630593] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[  331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  331.636324] Call Trace:
[  331.636934]  <TASK>
[  331.637521]  ? expand+0x1c9/0x200
[  331.638320]  ? __warn+0x7d/0x130
[  331.639116]  ? expand+0x1c9/0x200
[  331.639957]  ? report_bug+0x18d/0x1c0
[  331.640832]  ? handle_bug+0x41/0x70
[  331.641635]  ? exc_invalid_op+0x13/0x60
[  331.642522]  ? asm_exc_invalid_op+0x16/0x20
[  331.643494]  ? expand+0x1c9/0x200
[  331.644264]  ? expand+0x1c9/0x200
[  331.645007]  rmqueue_bulk+0xf4/0x530
[  331.645847]  get_page_from_freelist+0x3ed/0x1040
[  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[  331.647977]  __alloc_pages+0xec/0x240
[  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
[  331.650938]  alloc_pool_huge_folio+0xad/0x110
[  331.651909]  set_max_huge_pages+0x17d/0x390
[  331.652760]  nr_hugepages_store_common+0x91/0xf0
[  331.653825]  kernfs_fop_write_iter+0x108/0x1f0
[  331.654986]  vfs_write+0x207/0x400
[  331.655925]  ksys_write+0x63/0xe0
[  331.656832]  do_syscall_64+0x37/0x90
[  331.657793]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  331.660398] RIP: 0033:0x7f24b3a26e87
[  331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[  331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[  331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[  331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[  331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[  331.675841]  </TASK>
[  331.676450] ---[ end trace 0000000000000000 ]---
[  331.677659] ------------[ cut here ]------------


[  331.677659] ------------[ cut here ]------------
[  331.679109] page type is 5, passed migratetype is 1 (nr=512)
[  331.680376] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[  331.682314] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  331.691852] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
[  331.694026] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  331.696162] RIP: 0010:del_page_from_free_list+0x137/0x170
[  331.697589] Code: c6 05 a0 b5 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 68 9f 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 69 b7 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 a0 9f 22 82 48 89 df e8 13 e7 fc ff
[  331.702060] RSP: 0018:ffffc90003c97ac8 EFLAGS: 00010086
[  331.703430] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  331.705284] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[  331.707101] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  331.708933] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: 0000000000000001
[  331.710754] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 0000000000000009
[  331.712637] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  331.714861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  331.716466] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[  331.718441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  331.720372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  331.723583] Call Trace:
[  331.724351]  <TASK>
[  331.725045]  ? del_page_from_free_list+0x137/0x170
[  331.726370]  ? __warn+0x7d/0x130
[  331.727326]  ? del_page_from_free_list+0x137/0x170
[  331.728637]  ? report_bug+0x18d/0x1c0
[  331.729688]  ? handle_bug+0x41/0x70
[  331.730707]  ? exc_invalid_op+0x13/0x60
[  331.731798]  ? asm_exc_invalid_op+0x16/0x20
[  331.733007]  ? del_page_from_free_list+0x137/0x170
[  331.734317]  ? del_page_from_free_list+0x137/0x170
[  331.735649]  rmqueue_bulk+0xdf/0x530
[  331.736741]  get_page_from_freelist+0x3ed/0x1040
[  331.738069]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[  331.739578]  __alloc_pages+0xec/0x240
[  331.740666]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[  331.742135]  __alloc_fresh_hugetlb_folio+0x157/0x230
[  331.743521]  alloc_pool_huge_folio+0xad/0x110
[  331.744768]  set_max_huge_pages+0x17d/0x390
[  331.745988]  nr_hugepages_store_common+0x91/0xf0
[  331.747306]  kernfs_fop_write_iter+0x108/0x1f0
[  331.748651]  vfs_write+0x207/0x400
[  331.749735]  ksys_write+0x63/0xe0
[  331.750808]  do_syscall_64+0x37/0x90
[  331.753203]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  331.754857] RIP: 0033:0x7f24b3a26e87
[  331.756184] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  331.760239] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  331.761935] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[  331.763524] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[  331.765102] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[  331.766740] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[  331.768344] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[  331.769949]  </TASK>
[  331.770559] ---[ end trace 0000000000000000 ]---

-- 
Mike Kravetz

> ---
> 
> From b0cb92ed10b40fab0921002effa8b726df245790 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 15 Sep 2023 09:59:52 -0400
> Subject: [PATCH] mm: page_alloc: remove pcppage migratetype caching fix
> 
> Mike reports the following crash in -next:
> 
> [   28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
> [   28.645455] flags: 0x200000000000000(node=0|zone=2)
> [   28.646835] page_type: 0xffffffff()
> [   28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
> [   28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> [   28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
> [   28.654769] ------------[ cut here ]------------
> [   28.655972] kernel BUG at mm/page_alloc.c:1231!
> 
> This VM_BUG_ON() used to check that the cached pcppage_migratetype set
> by free_unref_page() wasn't MIGRATE_ISOLATE.
> 
> When I removed the caching, I erroneously changed the assert to check
> that no isolated pages are on the pcplist. This is quite different,
> because pages can be isolated *after* they had been put on the
> freelist already (which is handled just fine).
> 
> IOW, this was purely a sanity check on the migratetype caching. With
> that gone, the check should have been removed as well. Do that now.
> 
> Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3f1c777feed..9469e4660b53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1207,9 +1207,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			count -= nr_pages;
>  			pcp->count -= nr_pages;
>  
> -			/* MIGRATE_ISOLATE page should not go to pcplists */
> -			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> -
>  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
>  			trace_mm_page_pcpu_drain(page, order, mt);
>  		} while (count > 0 && !list_empty(list));
> -- 
> 2.42.0
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-16 19:57     ` Mike Kravetz
@ 2023-09-16 20:13       ` Andrew Morton
  2023-09-18  7:16       ` Vlastimil Babka
  1 sibling, 0 replies; 83+ messages in thread
From: Andrew Morton @ 2023-09-16 20:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Johannes Weiner, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On Sat, 16 Sep 2023 12:57:39 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > So this was purely a sanity check against the pcpmigratetype cache
> > operations. With that gone, we can remove it.
> 
> With the patch below applied, a slightly different workload triggers the
> following warnings.  It seems related, and appears to go away when
> reverting the series.

Thanks, I've dropped this v2 series from mm.git.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-15 14:16   ` Johannes Weiner
  2023-09-15 15:05     ` Mike Kravetz
  2023-09-16 19:57     ` Mike Kravetz
@ 2023-09-18  7:07     ` Vlastimil Babka
  2023-09-18 14:09       ` Johannes Weiner
  2 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-18  7:07 UTC (permalink / raw)
  To: Johannes Weiner, Mike Kravetz
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On 9/15/23 16:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> In next-20230913, I started hitting the following BUG.  Seems related
>> to this series.  And, if series is reverted I do not see the BUG.
>> 
>> I can easily reproduce on a small 16G VM.  kernel command line contains
>> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
>> while true; do
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>> 
>> For the BUG below I believe it was the first (or second) 1G page creation from
>> CMA that triggered:  cma_alloc of 1G.
>> 
>> Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Agreed, I assume you'll fold it in 1/6 in v3.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-16 19:57     ` Mike Kravetz
  2023-09-16 20:13       ` Andrew Morton
@ 2023-09-18  7:16       ` Vlastimil Babka
  2023-09-18 14:52         ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-18  7:16 UTC (permalink / raw)
  To: Mike Kravetz, Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang, Zi Yan,
	linux-mm, linux-kernel

On 9/16/23 21:57, Mike Kravetz wrote:
> On 09/15/23 10:16, Johannes Weiner wrote:
>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> > In next-20230913, I started hitting the following BUG.  Seems related
>> > to this series.  And, if series is reverted I do not see the BUG.
>> > 
>> > I can easily reproduce on a small 16G VM.  kernel command line contains
>> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
>> > while true; do
>> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> > done
>> > 
>> > For the BUG below I believe it was the first (or second) 1G page creation from
>> > CMA that triggered:  cma_alloc of 1G.
>> > 
>> > Sorry, have not looked deeper into the issue.
>> 
>> Thanks for the report, and sorry about the breakage!
>> 
>> I was scratching my head at this:
>> 
>>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>> 
>> because there is nothing in page isolation that prevents setting
>> MIGRATE_ISOLATE on something that's on the pcplist already. So why
>> didn't this trigger before already?
>> 
>> Then it clicked: it used to only check the *pcpmigratetype* determined
>> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>> 
>> Pages that get isolated while *already* on the pcplist are fine, and
>> are handled properly:
>> 
>>                         mt = get_pcppage_migratetype(page);
>> 
>>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>> 
>>                         /* Pageblock could have been isolated meanwhile */
>>                         if (unlikely(isolated_pageblocks))
>>                                 mt = get_pageblock_migratetype(page);
>> 
>> So this was purely a sanity check against the pcpmigratetype cache
>> operations. With that gone, we can remove it.
> 
> With the patch below applied, a slightly different workload triggers the
> following warnings.  It seems related, and appears to go away when
> reverting the series.
> 
> [  331.595382] ------------[ cut here ]------------
> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200

Initially I thought this demonstrates the possible race I was suggesting in
reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
are trying to get a MOVABLE page from a CMA page block, which is something
that's normally done and the pageblock stays CMA. So yeah if the warnings
are to stay, they need to handle this case. Maybe the same can happen with
HIGHATOMIC blocks?

> [  331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> [  331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
> [  331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [  331.613527] RIP: 0010:expand+0x1c9/0x200
> [  331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
> [  331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
> [  331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> [  331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
> [  331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> [  331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
> [  331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
> [  331.627810] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
> [  331.630593] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
> [  331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  331.636324] Call Trace:
> [  331.636934]  <TASK>
> [  331.637521]  ? expand+0x1c9/0x200
> [  331.638320]  ? __warn+0x7d/0x130
> [  331.639116]  ? expand+0x1c9/0x200
> [  331.639957]  ? report_bug+0x18d/0x1c0
> [  331.640832]  ? handle_bug+0x41/0x70
> [  331.641635]  ? exc_invalid_op+0x13/0x60
> [  331.642522]  ? asm_exc_invalid_op+0x16/0x20
> [  331.643494]  ? expand+0x1c9/0x200
> [  331.644264]  ? expand+0x1c9/0x200
> [  331.645007]  rmqueue_bulk+0xf4/0x530
> [  331.645847]  get_page_from_freelist+0x3ed/0x1040
> [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
> [  331.647977]  __alloc_pages+0xec/0x240
> [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
> [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
> [  331.650938]  alloc_pool_huge_folio+0xad/0x110
> [  331.651909]  set_max_huge_pages+0x17d/0x390
> [  331.652760]  nr_hugepages_store_common+0x91/0xf0
> [  331.653825]  kernfs_fop_write_iter+0x108/0x1f0
> [  331.654986]  vfs_write+0x207/0x400
> [  331.655925]  ksys_write+0x63/0xe0
> [  331.656832]  do_syscall_64+0x37/0x90
> [  331.657793]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  331.660398] RIP: 0033:0x7f24b3a26e87
> [  331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [  331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
> [  331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
> [  331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
> [  331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
> [  331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
> [  331.675841]  </TASK>
> [  331.676450] ---[ end trace 0000000000000000 ]---
> [  331.677659] ------------[ cut here ]------------
> 
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-18  7:07     ` Vlastimil Babka
@ 2023-09-18 14:09       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-18 14:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Kravetz, Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang,
	Zi Yan, linux-mm, linux-kernel

On Mon, Sep 18, 2023 at 09:07:53AM +0200, Vlastimil Babka wrote:
> On 9/15/23 16:16, Johannes Weiner wrote:
> > On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> In next-20230913, I started hitting the following BUG.  Seems related
> >> to this series.  And, if series is reverted I do not see the BUG.
> >> 
> >> I can easily reproduce on a small 16G VM.  kernel command line contains
> >> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> >> while true; do
> >>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> done
> >> 
> >> For the BUG below I believe it was the first (or second) 1G page creation from
> >> CMA that triggered:  cma_alloc of 1G.
> >> 
> >> Sorry, have not looked deeper into the issue.
> > 
> > Thanks for the report, and sorry about the breakage!
> > 
> > I was scratching my head at this:
> > 
> >                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > 
> > because there is nothing in page isolation that prevents setting
> > MIGRATE_ISOLATE on something that's on the pcplist already. So why
> > didn't this trigger before already?
> > 
> > Then it clicked: it used to only check the *pcpmigratetype* determined
> > by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> > 
> > Pages that get isolated while *already* on the pcplist are fine, and
> > are handled properly:
> > 
> >                         mt = get_pcppage_migratetype(page);
> > 
> >                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > 
> >                         /* Pageblock could have been isolated meanwhile */
> >                         if (unlikely(isolated_pageblocks))
> >                                 mt = get_pageblock_migratetype(page);
> > 
> > So this was purely a sanity check against the pcpmigratetype cache
> > operations. With that gone, we can remove it.
> 
> Agreed, I assume you'll fold it in 1/6 in v3.

Yes, will do.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-18  7:16       ` Vlastimil Babka
@ 2023-09-18 14:52         ` Johannes Weiner
  2023-09-18 17:40           ` Mike Kravetz
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-18 14:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Kravetz, Andrew Morton, Mel Gorman, Miaohe Lin, Kefeng Wang,
	Zi Yan, linux-mm, linux-kernel

On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> On 9/16/23 21:57, Mike Kravetz wrote:
> > On 09/15/23 10:16, Johannes Weiner wrote:
> >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> > In next-20230913, I started hitting the following BUG.  Seems related
> >> > to this series.  And, if series is reverted I do not see the BUG.
> >> > 
> >> > I can easily reproduce on a small 16G VM.  kernel command line contains
> >> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> >> > while true; do
> >> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> > done
> >> > 
> >> > For the BUG below I believe it was the first (or second) 1G page creation from
> >> > CMA that triggered:  cma_alloc of 1G.
> >> > 
> >> > Sorry, have not looked deeper into the issue.
> >> 
> >> Thanks for the report, and sorry about the breakage!
> >> 
> >> I was scratching my head at this:
> >> 
> >>                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >> 
> >> because there is nothing in page isolation that prevents setting
> >> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> >> didn't this trigger before already?
> >> 
> >> Then it clicked: it used to only check the *pcpmigratetype* determined
> >> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> >> 
> >> Pages that get isolated while *already* on the pcplist are fine, and
> >> are handled properly:
> >> 
> >>                         mt = get_pcppage_migratetype(page);
> >> 
> >>                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >> 
> >>                         /* Pageblock could have been isolated meanwhile */
> >>                         if (unlikely(isolated_pageblocks))
> >>                                 mt = get_pageblock_migratetype(page);
> >> 
> >> So this was purely a sanity check against the pcpmigratetype cache
> >> operations. With that gone, we can remove it.
> > 
> > With the patch below applied, a slightly different workload triggers the
> > following warnings.  It seems related, and appears to go away when
> > reverting the series.
> > 
> > [  331.595382] ------------[ cut here ]------------
> > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> 
> Initially I thought this demonstrates the possible race I was suggesting in
> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> are trying to get a MOVABLE page from a CMA page block, which is something
> that's normally done and the pageblock stays CMA. So yeah if the warnings
> are to stay, they need to handle this case. Maybe the same can happen with
> HIGHATOMIC blocks?

Hm I don't think that's quite it.

CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
into CMA and HIGHATOMIC, we explicitly pass that migratetype to
__rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
remainder to the CMA freelist, then returns the page. While you get a
different mt than requested, the freelist typing should be consistent.

In this splat, the migratetype passed to __rmqueue_smallest() is
MOVABLE. There is no preceding warning from del_page_from_freelist()
(Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
order-10 block from the MOVABLE list. So far so good. However, when we
expand() the order-9 tail of this block to the MOVABLE list, it warns
that its pageblock type is CMA.

This means we have an order-10 page where one half is MOVABLE and the
other is CMA.

I don't see how the merging code in __free_one_page() could have done
that. The CMA buddy would have failed the migrate_is_mergeable() test
and we should have left it at order-9s.

I also don't see how the CMA setup could have done this because
MIGRATE_CMA is set on the range before the pages are fed to the buddy.

Mike, could you describe the workload that is triggering this?

Does this reproduce instantly and reliably?

Is there high load on the system, or is it requesting the huge page
with not much else going on?

Do you see compact_* history in /proc/vmstat after this triggers?

Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
the hugetlb_cma= parameter you're using?

Thanks!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-18 14:52         ` Johannes Weiner
@ 2023-09-18 17:40           ` Mike Kravetz
  2023-09-19  6:49             ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Mike Kravetz @ 2023-09-18 17:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On 09/18/23 10:52, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > On 9/16/23 21:57, Mike Kravetz wrote:
> > > On 09/15/23 10:16, Johannes Weiner wrote:
> > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > 
> > > With the patch below applied, a slightly different workload triggers the
> > > following warnings.  It seems related, and appears to go away when
> > > reverting the series.
> > > 
> > > [  331.595382] ------------[ cut here ]------------
> > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > 
> > Initially I thought this demonstrates the possible race I was suggesting in
> > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > are trying to get a MOVABLE page from a CMA page block, which is something
> > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > are to stay, they need to handle this case. Maybe the same can happen with
> > HIGHATOMIC blocks?
> 
> Hm I don't think that's quite it.
> 
> CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
> into CMA and HIGHATOMIC, we explicitly pass that migratetype to
> __rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
> remainder to the CMA freelist, then returns the page. While you get a
> different mt than requested, the freelist typing should be consistent.
> 
> In this splat, the migratetype passed to __rmqueue_smallest() is
> MOVABLE. There is no preceding warning from del_page_from_freelist()
> (Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
> order-10 block from the MOVABLE list. So far so good. However, when we
> expand() the order-9 tail of this block to the MOVABLE list, it warns
> that its pageblock type is CMA.
> 
> This means we have an order-10 page where one half is MOVABLE and the
> other is CMA.
> 
> I don't see how the merging code in __free_one_page() could have done
> that. The CMA buddy would have failed the migrate_is_mergeable() test
> and we should have left it at order-9s.
> 
> I also don't see how the CMA setup could have done this because
> MIGRATE_CMA is set on the range before the pages are fed to the buddy.
> 
> Mike, could you describe the workload that is triggering this?

This 'slightly different workload' is actually a slightly different
environment.  Sorry for mis-speaking!  The slight difference is that this
environment does not use the 'alloc hugetlb gigantic pages from CMA'
(hugetlb_cma) feature that triggered the previous issue.

This is still on a 16G VM.  Kernel command line here is:
"BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
hugetlb_free_vmemmap=on"

The workload is just running this script:
while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

> 
> Does this reproduce instantly and reliably?
> 

It is not 'instant' but will reproduce fairly reliably within a minute
or so.

Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
will eventually be freed via __free_pages(folio, 9).

> Is there high load on the system, or is it requesting the huge page
> with not much else going on?

Only the script was running.

> Do you see compact_* history in /proc/vmstat after this triggers?

As one might expect, compact_isolated continually increases during this
this run.

> Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
> the hugetlb_cma= parameter you're using?

As mentioned above, hugetlb_cma is not used in this environment.  Strangely
enough, this does not reproduce (easily at least) if I use hugetlb_cma as
in the previous report.

The following are during a run after WARNING is triggered.

# cat /proc/zoneinfo
Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 11800
      nr_active_anon 109
      nr_inactive_file 38161
      nr_active_file 10007
      nr_unevictable 12
      nr_slab_reclaimable 2766
      nr_slab_unreclaimable 6881
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault_anon 0
      workingset_refault_file 0
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 11750
      nr_mapped    18402
      nr_file_pages 48339
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     166
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 6
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   14766
      nr_written   7701
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 96
      nr_foll_pin_released 96
      nr_kernel_stack 1816
      nr_page_table_pages 1100
      nr_sec_page_table_pages 0
      nr_swapcached 0
  pages free     3840
        boost    0
        min      21
        low      26
        high     31
        spanned  4095
        present  3998
        managed  3840
        cma      0
        protection: (0, 1908, 7923, 7923)
      nr_free_pages 3840
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 1
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 2
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 3
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone    DMA32
  pages free     495317
        boost    0
        min      2687
        low      3358
        high     4029
        spanned  1044480
        present  520156
        managed  496486
        cma      0
        protection: (0, 0, 6015, 6015)
      nr_free_pages 495317
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count: 913
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 1
              count: 0
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 2
              count: 0
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 3
              count: 256
              high:  1679
              batch: 63
  vm stats threshold: 30
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone   Normal
  pages free     1360836
        boost    0
        min      8473
        low      10591
        high     12709
        spanned  1572864
        present  1572864
        managed  1552266
        cma      0
        protection: (0, 0, 0, 0)
      nr_free_pages 1360836
      nr_zone_inactive_anon 11800
      nr_zone_active_anon 109
      nr_zone_inactive_file 38161
      nr_zone_active_file 10007
      nr_zone_unevictable 12
      nr_zone_write_pending 0
      nr_mlock     12
      nr_bounce    0
      nr_zspages   3
      nr_free_cma  0
      numa_hit     10623572
      numa_miss    0
      numa_foreign 0
      numa_interleave 1357
      numa_local   6902986
      numa_other   3720586
  pagesets
    cpu: 0
              count: 156
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 1
              count: 210
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 2
              count: 4956
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 3
              count: 1
              high:  5295
              batch: 63
  vm stats threshold: 42
  node_unreclaimable:  0
  start_pfn:           1048576
Node 0, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone      DMA
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone    DMA32
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone   Normal
  per-node stats
      nr_inactive_anon 15381
      nr_active_anon 81
      nr_inactive_file 66550
      nr_active_file 25965
      nr_unevictable 421
      nr_slab_reclaimable 4069
      nr_slab_unreclaimable 7836
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault_anon 0
      workingset_refault_file 0
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 15420
      nr_mapped    24331
      nr_file_pages 92978
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     100
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 11
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   6217
      nr_written   2902
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 0
      nr_foll_pin_released 0
      nr_kernel_stack 1656
      nr_page_table_pages 756
      nr_sec_page_table_pages 0
      nr_swapcached 0
  pages free     1829073
        boost    0
        min      11345
        low      14181
        high     17017
        spanned  2097152
        present  2097152
        managed  2086594
        cma      0
        protection: (0, 0, 0, 0)
      nr_free_pages 1829073
      nr_zone_inactive_anon 15381
      nr_zone_active_anon 81
      nr_zone_inactive_file 66550
      nr_zone_active_file 25965
      nr_zone_unevictable 421
      nr_zone_write_pending 0
      nr_mlock     421
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     10522401
      numa_miss    0
      numa_foreign 0
      numa_interleave 961
      numa_local   4057399
      numa_other   6465002
  pagesets
    cpu: 0
              count: 0
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 1
              count: 17
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 2
              count: 6997
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 3
              count: 0
              high:  7090
              batch: 63
  vm stats threshold: 42
  node_unreclaimable:  0
  start_pfn:           2621440
Node 1, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)

# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      1      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      1      0      1      2      2      3      3      3      4      4    480 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable    566     14     22      7      8      8      9      4      7      0      1 
Node    0, zone   Normal, type      Movable    214    299    120     53     15     10      6      6      1      4   1159 
Node    0, zone   Normal, type  Reclaimable      0      9     18     11      6      1      0      0      0      0      0 
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate 
Node 0, zone      DMA            1            7            0            0            0            0 
Node 0, zone    DMA32            0         1016            0            0            0            0 
Node 0, zone   Normal           71         2995            6            0            0            0 
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    1, zone   Normal, type    Unmovable    459     12      5      6      6      5      5      5      6      2      1 
Node    1, zone   Normal, type      Movable   1287    502    171     85     34     14     13      8      2      5   1861 
Node    1, zone   Normal, type  Reclaimable      1      5     12      6      9      3      1      1      0      1      0 
Node    1, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    1, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    1, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      3 

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate 
Node 1, zone   Normal          101         3977           10            0            0            8 

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-18 17:40           ` Mike Kravetz
@ 2023-09-19  6:49             ` Johannes Weiner
  2023-09-19 12:37               ` Zi Yan
  2023-09-19 18:47               ` Mike Kravetz
  0 siblings, 2 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-09-19  6:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> On 09/18/23 10:52, Johannes Weiner wrote:
> > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > 
> > > > With the patch below applied, a slightly different workload triggers the
> > > > following warnings.  It seems related, and appears to go away when
> > > > reverting the series.
> > > > 
> > > > [  331.595382] ------------[ cut here ]------------
> > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > 
> > > Initially I thought this demonstrates the possible race I was suggesting in
> > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > are to stay, they need to handle this case. Maybe the same can happen with
> > > HIGHATOMIC blocks?

Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
show any CMA pages.

5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
and HIGHATOMIC.

> > This means we have an order-10 page where one half is MOVABLE and the
> > other is CMA.

This means the scenario is different:

We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
that the first pageblock is indeed MOVABLE. During the expand, the
second pageblock turns out to be of type MIGRATE_ISOLATE.

The page allocator wouldn't have merged those types. It triggers a bit
too fast to be a race condition.

It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
while the head is on the list, and then stranded there.

Could this be an issue in the page_isolation code? Maybe a range
rounding error?

Zi Yan, does this ring a bell for you?

I don't quite see how my patches could have caused this. But AFAICS we
also didn't have warnings for this scenario so it could be an old bug.

> > Mike, could you describe the workload that is triggering this?
> 
> This 'slightly different workload' is actually a slightly different
> environment.  Sorry for mis-speaking!  The slight difference is that this
> environment does not use the 'alloc hugetlb gigantic pages from CMA'
> (hugetlb_cma) feature that triggered the previous issue.
> 
> This is still on a 16G VM.  Kernel command line here is:
> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> hugetlb_free_vmemmap=on"
> 
> The workload is just running this script:
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
> 
> > 
> > Does this reproduce instantly and reliably?
> > 
> 
> It is not 'instant' but will reproduce fairly reliably within a minute
> or so.
> 
> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> will eventually be freed via __free_pages(folio, 9).

No luck reproducing this yet, but I have a question. In that crash
stack trace, the expand() is called via this:

 [  331.645847]  get_page_from_freelist+0x3ed/0x1040
 [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
 [  331.647977]  __alloc_pages+0xec/0x240
 [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
 [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
 [  331.650938]  alloc_pool_huge_folio+0xad/0x110
 [  331.651909]  set_max_huge_pages+0x17d/0x390

I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
alloc_fresh_hugetlb_folio(), which has this:

        if (hstate_is_gigantic(h))
                folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
        else
                folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
                                nid, nmask, node_alloc_noretry);

where gigantic is defined as the order exceeding MAX_ORDER, which
should be the case for 1G pages on x86.

So the crashing stack must be from a 2M allocation, no? I'm confused
how that could happen with the above test case.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-19  6:49             ` Johannes Weiner
@ 2023-09-19 12:37               ` Zi Yan
  2023-09-19 15:22                 ` Zi Yan
  2023-09-19 18:47               ` Mike Kravetz
  1 sibling, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-19 12:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mike Kravetz, Vlastimil Babka, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5422 bytes --]

On 19 Sep 2023, at 2:49, Johannes Weiner wrote:

> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>> On 09/18/23 10:52, Johannes Weiner wrote:
>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>
>>>>> With the patch below applied, a slightly different workload triggers the
>>>>> following warnings.  It seems related, and appears to go away when
>>>>> reverting the series.
>>>>>
>>>>> [  331.595382] ------------[ cut here ]------------
>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>
>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>> HIGHATOMIC blocks?
>
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
>
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
>
>>> This means we have an order-10 page where one half is MOVABLE and the
>>> other is CMA.
>
> This means the scenario is different:
>
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
>
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
>
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
>
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
>
> Zi Yan, does this ring a bell for you?

Since isolation code works on pageblocks, a scenario I can think of
is that alloc_contig_range() is given a range starting from that tail
pageblock.

Hmm, I also notice that move_freepages_block() called by
set_migratetype_isolate() might change isolation range by your change.
I wonder if reverting that behavior would fix the issue. Basically,
do

	if (!zone_spans_pfn(zone, start))
		start = pfn;

in prep_move_freepages_block(). Just a wild guess. Mike, do you mind
giving it a try?

Meanwhile, let me try to reproduce it and look into it deeper.

>
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
>
>>> Mike, could you describe the workload that is triggering this?
>>
>> This 'slightly different workload' is actually a slightly different
>> environment.  Sorry for mis-speaking!  The slight difference is that this
>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>> (hugetlb_cma) feature that triggered the previous issue.
>>
>> This is still on a 16G VM.  Kernel command line here is:
>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>> hugetlb_free_vmemmap=on"
>>
>> The workload is just running this script:
>> while true; do
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>>
>>>
>>> Does this reproduce instantly and reliably?
>>>
>>
>> It is not 'instant' but will reproduce fairly reliably within a minute
>> or so.
>>
>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>> will eventually be freed via __free_pages(folio, 9).
>
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
>
>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>  [  331.647977]  __alloc_pages+0xec/0x240
>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>  [  331.651909]  set_max_huge_pages+0x17d/0x390
>
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
>
>         if (hstate_is_gigantic(h))
>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>         else
>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
>
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
>
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

That matches my thinking too. Why the crash happened during 1GB page
allocation time? The range should be 1GB-aligned and of course cannot
be in the middle of a MAX_ORDER free page block.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-19 12:37               ` Zi Yan
@ 2023-09-19 15:22                 ` Zi Yan
  0 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-09-19 15:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

On 19 Sep 2023, at 8:37, Zi Yan wrote:

> On 19 Sep 2023, at 2:49, Johannes Weiner wrote:
>
>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>>
>>>>>> With the patch below applied, a slightly different workload triggers the
>>>>>> following warnings.  It seems related, and appears to go away when
>>>>>> reverting the series.
>>>>>>
>>>>>> [  331.595382] ------------[ cut here ]------------
>>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>>
>>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>>> HIGHATOMIC blocks?
>>
>> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
>> show any CMA pages.
>>
>> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
>> and HIGHATOMIC.
>>
>>>> This means we have an order-10 page where one half is MOVABLE and the
>>>> other is CMA.
>>
>> This means the scenario is different:
>>
>> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
>> that the first pageblock is indeed MOVABLE. During the expand, the
>> second pageblock turns out to be of type MIGRATE_ISOLATE.
>>
>> The page allocator wouldn't have merged those types. It triggers a bit
>> too fast to be a race condition.
>>
>> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
>> while the head is on the list, and then stranded there.
>>
>> Could this be an issue in the page_isolation code? Maybe a range
>> rounding error?
>>
>> Zi Yan, does this ring a bell for you?
>
> Since isolation code works on pageblocks, a scenario I can think of
> is that alloc_contig_range() is given a range starting from that tail
> pageblock.
>
> Hmm, I also notice that move_freepages_block() called by
> set_migratetype_isolate() might change isolation range by your change.
> I wonder if reverting that behavior would fix the issue. Basically,
> do
>
> 	if (!zone_spans_pfn(zone, start))
> 		start = pfn;
>
> in prep_move_freepages_block(). Just a wild guess. Mike, do you mind
> giving it a try?
>
> Meanwhile, let me try to reproduce it and look into it deeper.
>
>>
>> I don't quite see how my patches could have caused this. But AFAICS we
>> also didn't have warnings for this scenario so it could be an old bug.
>>
>>>> Mike, could you describe the workload that is triggering this?
>>>
>>> This 'slightly different workload' is actually a slightly different
>>> environment.  Sorry for mis-speaking!  The slight difference is that this
>>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>>> (hugetlb_cma) feature that triggered the previous issue.
>>>
>>> This is still on a 16G VM.  Kernel command line here is:
>>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>>> hugetlb_free_vmemmap=on"
>>>
>>> The workload is just running this script:
>>> while true; do
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>>>
>>>> Does this reproduce instantly and reliably?
>>>>
>>>
>>> It is not 'instant' but will reproduce fairly reliably within a minute
>>> or so.
>>>
>>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>>> will eventually be freed via __free_pages(folio, 9).
>>
>> No luck reproducing this yet, but I have a question. In that crash
>> stack trace, the expand() is called via this:

I cannot reproduce it locally either. Do you mind sharing your config file?

Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-19  6:49             ` Johannes Weiner
  2023-09-19 12:37               ` Zi Yan
@ 2023-09-19 18:47               ` Mike Kravetz
  2023-09-19 20:57                 ` Zi Yan
  1 sibling, 1 reply; 83+ messages in thread
From: Mike Kravetz @ 2023-09-19 18:47 UTC (permalink / raw)
  To: Johannes Weiner, Zi Yan
  Cc: Vlastimil Babka, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10997 bytes --]

On 09/19/23 02:49, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> > On 09/18/23 10:52, Johannes Weiner wrote:
> > > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > > 
> > > > > With the patch below applied, a slightly different workload triggers the
> > > > > following warnings.  It seems related, and appears to go away when
> > > > > reverting the series.
> > > > > 
> > > > > [  331.595382] ------------[ cut here ]------------
> > > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > > 
> > > > Initially I thought this demonstrates the possible race I was suggesting in
> > > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > > are to stay, they need to handle this case. Maybe the same can happen with
> > > > HIGHATOMIC blocks?
> 
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
> 
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
> 
> > > This means we have an order-10 page where one half is MOVABLE and the
> > > other is CMA.
> 
> This means the scenario is different:
> 
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
> 
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
> 
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
> 
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
> 
> Zi Yan, does this ring a bell for you?
> 
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
> 
> > > Mike, could you describe the workload that is triggering this?
> > 
> > This 'slightly different workload' is actually a slightly different
> > environment.  Sorry for mis-speaking!  The slight difference is that this
> > environment does not use the 'alloc hugetlb gigantic pages from CMA'
> > (hugetlb_cma) feature that triggered the previous issue.
> > 
> > This is still on a 16G VM.  Kernel command line here is:
> > "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> > root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> > console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> > hugetlb_free_vmemmap=on"
> > 
> > The workload is just running this script:
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > > 
> > > Does this reproduce instantly and reliably?
> > > 
> > 
> > It is not 'instant' but will reproduce fairly reliably within a minute
> > or so.
> > 
> > Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> > to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> > will eventually be freed via __free_pages(folio, 9).
> 
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
> 
>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>  [  331.647977]  __alloc_pages+0xec/0x240
>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>  [  331.651909]  set_max_huge_pages+0x17d/0x390
> 
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
> 
>         if (hstate_is_gigantic(h))
>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>         else
>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
> 
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
> 
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

Sorry for causing the confusion!

When I originally saw the warnings pop up, I was running the above script
as well as another that only allocated order 9 hugetlb pages:

while true; do
	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

The warnings were actually triggered by allocations in this second script.

However, when reporting the warnings I wanted to include the simplest
way to recreate.  And, I noticed that that second script running in
parallel was not required.  Again, sorry for the confusion!  Here is a
warning triggered via the alloc_contig_range path only running the one
script.

[  107.275821] ------------[ cut here ]------------
[  107.277001] page type is 0, passed migratetype is 1 (nr=512)
[  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
[  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
[  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
[  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
[  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
[  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
[  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
[  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
[  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
[  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.321575] Call Trace:
[  107.322314]  <TASK>
[  107.323002]  ? del_page_from_free_list+0x137/0x170
[  107.324380]  ? __warn+0x7d/0x130
[  107.325341]  ? del_page_from_free_list+0x137/0x170
[  107.326627]  ? report_bug+0x18d/0x1c0
[  107.327632]  ? prb_read_valid+0x17/0x20
[  107.328711]  ? handle_bug+0x41/0x70
[  107.329685]  ? exc_invalid_op+0x13/0x60
[  107.330787]  ? asm_exc_invalid_op+0x16/0x20
[  107.331937]  ? del_page_from_free_list+0x137/0x170
[  107.333189]  __free_one_page+0x2ab/0x6f0
[  107.334375]  free_pcppages_bulk+0x169/0x210
[  107.335575]  drain_pages_zone+0x3f/0x50
[  107.336691]  __drain_all_pages+0xe2/0x1e0
[  107.337843]  alloc_contig_range+0x143/0x280
[  107.339026]  alloc_contig_pages+0x210/0x270
[  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
[  107.341529]  alloc_pool_huge_page+0x7d/0x100
[  107.342745]  set_max_huge_pages+0x162/0x340
[  107.345059]  nr_hugepages_store_common+0x91/0xf0
[  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
[  107.347547]  vfs_write+0x207/0x400
[  107.348543]  ksys_write+0x63/0xe0
[  107.349511]  do_syscall_64+0x37/0x90
[  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  107.351940] RIP: 0033:0x7fabb8daee87
[  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
[  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
[  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
[  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
[  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
[  107.365968]  </TASK>
[  107.366534] ---[ end trace 0000000000000000 ]---
[  121.542474] ------------[ cut here ]------------

Perhaps that is another piece of information in that the warning can be
triggered via both allocation paths.

To be perfectly clear, here is what I did today:
- built next-20230919.  It does not contain your series
  	I could not recreate the issue.
- Added your series and the patch to remove
  VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
	I could recreate the issue while running only the one script.
	The warning above is from that run.
- Added this suggested patch from Zi
	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
	index 1400e674ab86..77a4aea31a7f 100644
	--- a/mm/page_alloc.c
	+++ b/mm/page_alloc.c
	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 		end = pageblock_end_pfn(pfn) - 1;
 
 		/* Do not cross zone boundaries */
	+#if 0
 		if (!zone_spans_pfn(zone, start))
			start = zone->zone_start_pfn;
	+#else
	+	if (!zone_spans_pfn(zone, start))
	+		start = pfn;
	+#endif
	 	if (!zone_spans_pfn(zone, end))
	 		return false;
	I can still trigger warnings.

One idea about recreating the issue is that it may have to do with size
of my VM (16G) and the requested allocation sizes 4G.  However, I tried
to really stress the allocations by increasing the number of hugetlb
pages requested and that did not help.  I also noticed that I only seem
to get two warnings and then they stop, even if I continue to run the
script.
 
Zi asked about my config, so it is attached.
-- 
Mike Kravetz

[-- Attachment #2: mike.config --]
[-- Type: text/plain, Size: 158653 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 6.6.0-rc2 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=120301
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23800
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23800
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_TOOLS_SUPPORT_RELR=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=125
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
# CONFIG_BPF_SYSCALL is not set
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_DEFAULT_ON=y
# end of BPF subsystem

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_SCHED_CORE=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_HAVE_SCHED_AVG_IRQ=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
# end of RCU Subsystem

CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
# CONFIG_PRINTK_INDEX is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC11_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
# CONFIG_CGROUP_FAVOR_DYNMODS is not set
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_MM_CID=y
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_MISC is not set
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_SCHED_AUTOGROUP=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
# CONFIG_BOOT_CONFIG is not set
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_LD_ORPHAN_WARN=y
CONFIG_LD_ORPHAN_WARN_LEVEL="warn"
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_SELFTEST is not set
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_RSEQ=y
CONFIG_CACHESTAT_SYSCALL=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_GUEST_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y

#
# Kexec and crash features
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
# CONFIG_KEXEC_SIG is not set
CONFIG_KEXEC_JUMP=y
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
CONFIG_CRASH_MAX_MEMORY_RANGES=8192
# end of Kexec and crash features
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
# CONFIG_X86_CPU_RESCTRL is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
CONFIG_X86_INTEL_LPSS=y
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_XXL=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_X86_HV_CALLBACK_VECTOR=y
CONFIG_XEN=y
CONFIG_XEN_PV=y
CONFIG_XEN_512GB=y
CONFIG_XEN_PV_SMP=y
CONFIG_XEN_PV_DOM0=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_PVHVM_SMP=y
CONFIG_XEN_PVHVM_GUEST=y
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
# CONFIG_XEN_PVH is not set
CONFIG_XEN_DOM0=y
CONFIG_XEN_PV_MSR_SAFE=y
CONFIG_KVM_GUEST=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
# CONFIG_PVH is not set
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_ACRN_GUEST is not set
# CONFIG_INTEL_TDX_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
CONFIG_BOOT_VESA_SUPPORT=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=8
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_PERF_EVENTS_AMD_UNCORE=y
# CONFIG_PERF_EVENTS_AMD_BRS is not set
# end of Performance monitoring

CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
CONFIG_MICROCODE=y
# CONFIG_MICROCODE_LATE_LOADING is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_X86_5LEVEL is not set
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
# CONFIG_AMD_MEM_ENCRYPT is not set
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK is not set
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_X86_UMIP=y
CONFIG_CC_HAS_IBT=y
# CONFIG_X86_KERNEL_IBT is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
# CONFIG_X86_SGX is not set
# CONFIG_X86_USER_SHADOW_STACK is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_HANDOVER_PROTOCOL=y
# CONFIG_EFI_MIXED is not set
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_ARCH_SUPPORTS_KEXEC=y
CONFIG_ARCH_SUPPORTS_KEXEC_FILE=y
CONFIG_ARCH_SELECTS_KEXEC_FILE=y
CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG_FORCE=y
CONFIG_ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_JUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_DUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_HOTPLUG=y
CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
# CONFIG_ADDRESS_MASKING is not set
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
CONFIG_LEGACY_VSYSCALL_XONLY=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
# CONFIG_STRICT_SIGALTSTACK_SIZE is not set
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
# end of Processor type and features

CONFIG_CC_HAS_SLS=y
CONFIG_CC_HAS_RETURN_THUNK=y
CONFIG_CC_HAS_ENTRY_PADDING=y
CONFIG_FUNCTION_PADDING_CFI=11
CONFIG_FUNCTION_PADDING_BYTES=16
CONFIG_CALL_PADDING=y
CONFIG_HAVE_CALL_THUNKS=y
CONFIG_CALL_THUNKS=y
CONFIG_PREFIX_SYMBOLS=y
CONFIG_SPECULATION_MITIGATIONS=y
CONFIG_PAGE_TABLE_ISOLATION=y
CONFIG_RETPOLINE=y
CONFIG_RETHUNK=y
CONFIG_CPU_UNRET_ENTRY=y
CONFIG_CALL_DEPTH_TRACKING=y
# CONFIG_CALL_THUNKS_DEBUG is not set
CONFIG_CPU_IBPB_ENTRY=y
CONFIG_CPU_IBRS_ENTRY=y
CONFIG_CPU_SRSO=y
# CONFIG_SLS is not set
# CONFIG_GDS_FORCE_MITIGATION is not set
CONFIG_ARCH_HAS_ADD_PAGES=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_HIBERNATION_SNAPSHOT_DEV=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_USERSPACE_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_TAD is not set
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
CONFIG_ACPI_BGRT=y
# CONFIG_ACPI_NFIT is not set
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_HMAT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_ACPI_PFRUT is not set
CONFIG_ACPI_PCC=y
# CONFIG_ACPI_FFH is not set
# CONFIG_PMIC_OPREGION is not set
CONFIG_ACPI_PRMT=y
CONFIG_X86_PM_TIMER=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_AMD_PSTATE is not set
# CONFIG_X86_AMD_PSTATE_UT is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
CONFIG_CPU_IDLE_GOV_HALTPOLL=y
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle

CONFIG_INTEL_IDLE=y
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_MMCONF_FAM10H=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_EMULATION_DEFAULT_DISABLED is not set
# CONFIG_X86_X32_ABI is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
# end of Binary Emulations

CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
# CONFIG_KVM is not set
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y
CONFIG_AS_GFNI=y
CONFIG_AS_WRUSS=y

#
# General architecture-dependent options
#
CONFIG_HOTPLUG_SMT=y
CONFIG_HOTPLUG_CORE_SYNC=y
CONFIG_HOTPLUG_CORE_SYNC_DEAD=y
CONFIG_HOTPLUG_CORE_SYNC_FULL=y
CONFIG_HOTPLUG_SPLIT_STARTUP=y
CONFIG_HOTPLUG_PARALLEL=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_KRETPROBE_ON_RETHOOK=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_ARCH_HAS_CPU_FINALIZE_INIT=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_RUST=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_MMU_LAZY_TLB_REFCOUNT=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_ARCH_SUPPORTS_CFI_CLANG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING_USER=y
CONFIG_HAVE_CONTEXT_TRACKING_USER_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_PMD_MKWRITE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_OBJTOOL=y
CONFIG_HAVE_JUMP_LABEL_HACK=y
CONFIG_HAVE_NOINSTR_HACK=y
CONFIG_HAVE_NOINSTR_VALIDATION=y
CONFIG_HAVE_UACCESS_VALIDATION=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_FUNCTION_ALIGNMENT_4B=y
CONFIG_FUNCTION_ALIGNMENT_16B=y
CONFIG_FUNCTION_ALIGNMENT=16
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_DEBUG is not set
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODULE_UNLOAD_TAINT_TRACKING is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODPROBE_PATH="/sbin/modprobe"
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
CONFIG_BLK_CGROUP_RWSTAT=y
CONFIG_BLK_CGROUP_PUNT_BIO=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_INTEGRITY_T10=y
# CONFIG_BLK_DEV_ZONED is not set
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_DEV_THROTTLING_LOW is not set
# CONFIG_BLK_WBT is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
# CONFIG_BLK_CGROUP_IOPRIO is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_AIX_PARTITION=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
# end of Partition Types

CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y
CONFIG_BLK_MQ_STACKING=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
# end of IO Schedulers

CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_ZPOOL=y
CONFIG_SWAP=y
CONFIG_ZSWAP=y
# CONFIG_ZSWAP_DEFAULT_ON is not set
# CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="lzo"
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD=y
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT="zbud"
CONFIG_ZBUD=y
# CONFIG_Z3FOLD is not set
CONFIG_ZSMALLOC=y
# CONFIG_ZSMALLOC_STAT is not set
CONFIG_ZSMALLOC_CHAIN_SIZE=8

#
# SLAB allocator options
#
# CONFIG_SLAB_DEPRECATED is not set
CONFIG_SLUB=y
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SLUB_STATS is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_RANDOM_KMALLOC_CACHES is not set
# end of SLAB allocator options

# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
# CONFIG_COMPAT_BRK is not set
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_EXCLUSIVE_SYSTEM_RAM=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_MEMORY_BALLOON=y
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
CONFIG_PAGE_REPORTING=y
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
CONFIG_THP_SWAP=y
# CONFIG_READ_ONLY_THP_FOR_FS is not set
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_CMA=y
# CONFIG_CMA_DEBUG is not set
CONFIG_CMA_DEBUGFS=y
# CONFIG_CMA_SYSFS is not set
CONFIG_CMA_AREAS=19
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ZONE_DMA=y
CONFIG_ZONE_DMA32=y
# CONFIG_ZONE_DEVICE is not set
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_TEST is not set
# CONFIG_DMAPOOL_TEST is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_MEMFD_CREATE=y
CONFIG_SECRETMEM=y
# CONFIG_ANON_VMA_NAME is not set
CONFIG_USERFAULTFD=y
CONFIG_HAVE_ARCH_USERFAULTFD_WP=y
CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y
CONFIG_PTE_MARKER_UFFD_WP=y
# CONFIG_LRU_GEN is not set
CONFIG_ARCH_SUPPORTS_PER_VMA_LOCK=y
CONFIG_PER_VMA_LOCK=y
CONFIG_LOCK_MM_AND_FIND_VMA=y
CONFIG_EXECMEM=y

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_NET_EGRESS=y
CONFIG_NET_XGRESS=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_USER_COMPAT is not set
# CONFIG_XFRM_INTERFACE is not set
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
CONFIG_NET_HANDSHAKE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE_COMMON=y
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_NET_IPVTI is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_TABLE_PERTURB_ORDER=16
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
CONFIG_IPV6_MIP6=y
# CONFIG_IPV6_ILA is not set
# CONFIG_IPV6_VTI is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
# CONFIG_IPV6_RPL_LWTUNNEL is not set
# CONFIG_IPV6_IOAM6_LWTUNNEL is not set
CONFIG_NETLABEL=y
# CONFIG_MPTCP is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
CONFIG_NETWORK_PHY_TIMESTAMPING=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
# CONFIG_BRIDGE_NETFILTER is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_EGRESS=y
CONFIG_NETFILTER_SKIP_EGRESS=y
CONFIG_NETFILTER_FAMILY_BRIDGE=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NETFILTER_NETLINK_OSF is not set
CONFIG_NF_CONNTRACK=m
# CONFIG_NF_LOG_SYSLOG is not set
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
# CONFIG_NF_CONNTRACK_ZONES is not set
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
CONFIG_NF_CONNTRACK_TIMESTAMP=y
# CONFIG_NF_CONNTRACK_LABELS is not set
# CONFIG_NF_CT_PROTO_DCCP is not set
# CONFIG_NF_CT_PROTO_SCTP is not set
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_SNMP is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NF_NAT=m
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XTABLES_COMPAT=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set
# CONFIG_NETFILTER_XT_CONNMARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CHECKSUM is not set
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_CONNSECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_CT is not set
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
# CONFIG_NETFILTER_XT_TARGET_HL is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
CONFIG_NETFILTER_XT_NAT=m
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
# CONFIG_NETFILTER_XT_TARGET_MASQUERADE is not set
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_TPROXY is not set
# CONFIG_NETFILTER_XT_TARGET_TRACE is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_CLUSTER is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLABEL is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# end of Core Netfilter Configuration

# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_TPROXY_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
CONFIG_NF_REJECT_IPV4=y
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_RPFILTER is not set
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
# CONFIG_IP_NF_TARGET_SYNPROXY is not set
CONFIG_IP_NF_NAT=m
# CONFIG_IP_NF_TARGET_MASQUERADE is not set
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
CONFIG_IP_NF_MANGLE=m
# CONFIG_IP_NF_TARGET_ECN is not set
# CONFIG_IP_NF_TARGET_TTL is not set
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_SECURITY=m
# CONFIG_IP_NF_ARPTABLES is not set
# end of IP: Netfilter Configuration

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_TPROXY_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
CONFIG_NF_REJECT_IPV6=m
# CONFIG_NF_LOG_IPV6 is not set
CONFIG_IP6_NF_IPTABLES=m
# CONFIG_IP6_NF_MATCH_AH is not set
# CONFIG_IP6_NF_MATCH_EUI64 is not set
# CONFIG_IP6_NF_MATCH_FRAG is not set
# CONFIG_IP6_NF_MATCH_OPTS is not set
# CONFIG_IP6_NF_MATCH_HL is not set
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set
# CONFIG_IP6_NF_MATCH_MH is not set
CONFIG_IP6_NF_MATCH_RPFILTER=m
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_MATCH_SRH is not set
# CONFIG_IP6_NF_TARGET_HL is not set
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
# CONFIG_IP6_NF_TARGET_SYNPROXY is not set
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_RAW=m
CONFIG_IP6_NF_SECURITY=m
CONFIG_IP6_NF_NAT=m
# CONFIG_IP6_NF_TARGET_MASQUERADE is not set
# CONFIG_IP6_NF_TARGET_NPT is not set
# end of IPv6: Netfilter Configuration

CONFIG_NF_DEFRAG_IPV6=m
# CONFIG_NF_CONNTRACK_BRIDGE is not set
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
# CONFIG_BRIDGE_EBT_802_3 is not set
# CONFIG_BRIDGE_EBT_AMONG is not set
# CONFIG_BRIDGE_EBT_ARP is not set
# CONFIG_BRIDGE_EBT_IP is not set
# CONFIG_BRIDGE_EBT_IP6 is not set
# CONFIG_BRIDGE_EBT_LIMIT is not set
# CONFIG_BRIDGE_EBT_MARK is not set
# CONFIG_BRIDGE_EBT_PKTTYPE is not set
# CONFIG_BRIDGE_EBT_STP is not set
# CONFIG_BRIDGE_EBT_VLAN is not set
# CONFIG_BRIDGE_EBT_ARPREPLY is not set
# CONFIG_BRIDGE_EBT_DNAT is not set
# CONFIG_BRIDGE_EBT_MARK_T is not set
# CONFIG_BRIDGE_EBT_REDIRECT is not set
# CONFIG_BRIDGE_EBT_SNAT is not set
# CONFIG_BRIDGE_EBT_LOG is not set
# CONFIG_BRIDGE_EBT_NFLOG is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
# CONFIG_BRIDGE_MRP is not set
# CONFIG_BRIDGE_CFM is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_CBS is not set
# CONFIG_NET_SCH_ETF is not set
# CONFIG_NET_SCH_TAPRIO is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_SKBPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
CONFIG_NET_SCH_FQ_CODEL=y
# CONFIG_NET_SCH_CAKE is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_SCH_ETS is not set
# CONFIG_NET_SCH_DEFAULT is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
# CONFIG_NET_EMATCH_IPT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_SAMPLE is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_MPLS is not set
# CONFIG_NET_ACT_VLAN is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_NET_ACT_CONNMARK is not set
# CONFIG_NET_ACT_CTINFO is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
# CONFIG_NET_ACT_GATE is not set
# CONFIG_NET_TC_SKB_EXT is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
# CONFIG_DNS_RESOLVER is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_MAX_SKB_FRAGS=17
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
CONFIG_NET_DROP_MONITOR=y
# end of Network testing
# end of Networking options

CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
CONFIG_BT=m
CONFIG_BT_BREDR=y
# CONFIG_BT_RFCOMM is not set
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_CMTP is not set
# CONFIG_BT_HIDP is not set
CONFIG_BT_HS=y
CONFIG_BT_LE=y
CONFIG_BT_LE_L2CAP_ECRED=y
# CONFIG_BT_LEDS is not set
# CONFIG_BT_MSFTEXT is not set
# CONFIG_BT_AOSPEXT is not set
CONFIG_BT_DEBUGFS=y
# CONFIG_BT_SELFTEST is not set

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBCM4377 is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_BT_MRVL is not set
# CONFIG_BT_VIRTIO is not set
# end of Bluetooth device drivers

# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_MCTP is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
CONFIG_RFKILL=m
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
# CONFIG_RFKILL_GPIO is not set
CONFIG_NET_9P=m
CONFIG_NET_9P_FD=m
CONFIG_NET_9P_VIRTIO=m
# CONFIG_NET_9P_XEN is not set
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_GRO_CELLS=y
CONFIG_NET_SELFTESTS=y
CONFIG_FAILOVER=m
CONFIG_ETHTOOL_NETLINK=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIE_ECRC=y
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
# CONFIG_PCI_PF_STUB is not set
# CONFIG_XEN_PCIDEV_FRONTEND is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# PCI controller drivers
#
# CONFIG_VMD is not set

#
# Cadence-based PCIe controllers
#
# end of Cadence-based PCIe controllers

#
# DesignWare-based PCIe controllers
#
# CONFIG_PCI_MESON is not set
# CONFIG_PCIE_DW_PLAT_HOST is not set
# end of DesignWare-based PCIe controllers

#
# Mobiveil-based PCIe controllers
#
# end of Mobiveil-based PCIe controllers
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
# CONFIG_UEVENT_HELPER is not set
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
# CONFIG_DEVTMPFS_SAFE is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_DEBUG=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER is not set
# CONFIG_FW_LOADER_COMPRESS is not set
CONFIG_FW_CACHE=y
# CONFIG_FW_UPLOAD is not set
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_SYS_HYPERVISOR=y
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# CONFIG_FW_DEVLINK_SYNC_STATE_TIMEOUT is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# CONFIG_MHI_BUS_EP is not set
# end of Bus devices

#
# Cache Drivers
#
# end of Cache Drivers

CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=y
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
CONFIG_SYSFB=y
# CONFIG_SYSFB_SIMPLEFB is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE=y
CONFIG_EFI_DXE_MEM_ATTRIBUTES=y
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_CUSTOM_SSDT_OVERLAYS=y
# CONFIG_EFI_DISABLE_RUNTIME is not set
# CONFIG_EFI_COCO_SECRET is not set
# end of EFI (Extensible Firmware Interface) Support

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
# CONFIG_PARPORT_SERIAL is not set
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_PC_PCMCIA is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
CONFIG_CDROM=y
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
CONFIG_ZRAM=y
CONFIG_ZRAM_DEF_COMP_LZORLE=y
# CONFIG_ZRAM_DEF_COMP_LZO is not set
CONFIG_ZRAM_DEF_COMP="lzo-rle"
# CONFIG_ZRAM_WRITEBACK is not set
# CONFIG_ZRAM_MEMORY_TRACKING is not set
# CONFIG_ZRAM_MULTI_COMP is not set
# CONFIG_BLK_DEV_LOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_XEN_BLKDEV_FRONTEND is not set
# CONFIG_XEN_BLKDEV_BACKEND is not set
CONFIG_VIRTIO_BLK=m
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_UBLK is not set

#
# NVME Support
#
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TCP is not set
# CONFIG_NVME_TARGET is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# end of Texas Instruments shared transport line discipline

# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_BCM_VK is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
# CONFIG_UACCE is not set
# CONFIG_PVPANIC is not set
# CONFIG_GP_PCI1XXXX is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI_COMMON=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_CHR_DEV_SG=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT3SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_MPI3MR is not set
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_XEN_SCSI_FRONTEND is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_FDOMAIN_PCI is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_VIRTIO is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# end of SCSI device support

CONFIG_ATA=y
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_AHCI_DWC is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_DWC is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_PARPORT is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_BITMAP_FILE=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
CONFIG_DM_DEBUG=y
CONFIG_DM_BUFIO=y
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
# CONFIG_DM_UNSTRIPED is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_WRITECACHE is not set
# CONFIG_DM_EBS is not set
# CONFIG_DM_ERA is not set
# CONFIG_DM_CLONE is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_DUST is not set
# CONFIG_DM_INIT is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_DM_INTEGRITY is not set
# CONFIG_DM_AUDIT is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_MII=m
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_IPVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
# CONFIG_AMT is not set
# CONFIG_MACSEC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=m
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_NET_VENDOR_AGERE is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
# CONFIG_AMD_XGBE is not set
# CONFIG_PDS_CORE is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ASIX=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_CX_ECAT is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BCMGENET is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2X is not set
# CONFIG_SYSTEMPORT is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_CADENCE=y
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
CONFIG_NET_VENDOR_CORTINA=y
CONFIG_NET_VENDOR_DAVICOM=y
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_ENGLEDER=y
# CONFIG_TSNEP is not set
CONFIG_NET_VENDOR_EZCHIP=y
# CONFIG_NET_VENDOR_FUJITSU is not set
CONFIG_NET_VENDOR_FUNGIBLE=y
# CONFIG_FUN_ETH is not set
CONFIG_NET_VENDOR_GOOGLE=y
# CONFIG_GVE is not set
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
# CONFIG_NET_VENDOR_I825XX is not set
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
# CONFIG_ICE is not set
# CONFIG_FM10K is not set
# CONFIG_IGC is not set
# CONFIG_IDPF is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_OCTEON_EP is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_LAN743X is not set
# CONFIG_VCAP is not set
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MICROSOFT=y
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NI=y
# CONFIG_NI_XGE_MANAGEMENT_ENET is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_NETERION=y
# CONFIG_S2IO is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_VENDOR_PACKET_ENGINES=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_PENSANDO=y
# CONFIG_IONIC is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
# CONFIG_NET_VENDOR_QUALCOMM is not set
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_ATP is not set
CONFIG_8139CP=m
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_ROCKER=y
# CONFIG_NET_VENDOR_SAMSUNG is not set
# CONFIG_NET_VENDOR_SEEQ is not set
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_SOCIONEXT=y
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_SYNOPSYS=y
# CONFIG_DWC_XLGMAC is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_PHY_SEL is not set
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VERTEXCOM=y
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WANGXUN=y
# CONFIG_NGBE is not set
# CONFIG_TXGBE is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XILINX=y
# CONFIG_XILINX_EMACLITE is not set
# CONFIG_XILINX_AXI_EMAC is not set
# CONFIG_XILINX_LL_TEMAC is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set
CONFIG_FIXED_PHY=y

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_ADIN_PHY is not set
# CONFIG_ADIN1100_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AX88796B_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_BCM54140_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM84881_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_CORTINA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MARVELL_88Q2XXX_PHY is not set
# CONFIG_MARVELL_88X2222_PHY is not set
# CONFIG_MAXLINEAR_GPHY is not set
# CONFIG_MEDIATEK_GE_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_T1S_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROCHIP_T1_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_MOTORCOMM_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_NXP_CBTX_PHY is not set
# CONFIG_NXP_C45_TJA11XX_PHY is not set
# CONFIG_NXP_TJA11XX_PHY is not set
# CONFIG_NCN26000_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_RENESAS_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_DP83822_PHY is not set
# CONFIG_DP83TC811_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_DP83869_PHY is not set
# CONFIG_DP83TD510_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_PSE_CONTROLLER is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_FWNODE_MDIO=y
CONFIG_ACPI_MDIO=y
CONFIG_MDIO_DEVRES=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_MVUSB is not set
# CONFIG_MDIO_THUNDER is not set

#
# MDIO Multiplexers
#

#
# PCS device drivers
#
# end of PCS device drivers

# CONFIG_PLIP is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
CONFIG_WLAN_VENDOR_ADMTEK=y
CONFIG_WLAN_VENDOR_ATH=y
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K_PCI is not set
CONFIG_WLAN_VENDOR_ATMEL=y
CONFIG_WLAN_VENDOR_BROADCOM=y
CONFIG_WLAN_VENDOR_CISCO=y
CONFIG_WLAN_VENDOR_INTEL=y
CONFIG_WLAN_VENDOR_INTERSIL=y
# CONFIG_HOSTAP is not set
CONFIG_WLAN_VENDOR_MARVELL=y
CONFIG_WLAN_VENDOR_MEDIATEK=y
CONFIG_WLAN_VENDOR_MICROCHIP=y
CONFIG_WLAN_VENDOR_PURELIFI=y
CONFIG_WLAN_VENDOR_RALINK=y
CONFIG_WLAN_VENDOR_REALTEK=y
CONFIG_WLAN_VENDOR_RSI=y
CONFIG_WLAN_VENDOR_SILABS=y
CONFIG_WLAN_VENDOR_ST=y
CONFIG_WLAN_VENDOR_TI=y
CONFIG_WLAN_VENDOR_ZYDAS=y
CONFIG_WLAN_VENDOR_QUANTENNA=y
# CONFIG_PCMCIA_RAYCS is not set
# CONFIG_WAN is not set

#
# Wireless WAN
#
# CONFIG_WWAN is not set
# end of Wireless WAN

# CONFIG_XEN_NETDEV_FRONTEND is not set
# CONFIG_XEN_NETDEV_BACKEND is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_NETDEVSIM is not set
CONFIG_NET_FAILOVER=m
CONFIG_ISDN=y
CONFIG_ISDN_CAPI=y
# CONFIG_MISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set
CONFIG_INPUT_VIVALDIFMAP=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1050 is not set
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_CYPRESS_SF is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_ELANTECH_SMBUS=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
# CONFIG_MOUSE_PS2_VMMOUSE is not set
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_DB9 is not set
# CONFIG_JOYSTICK_GAMECON is not set
# CONFIG_JOYSTICK_TURBOGRAFX is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_JOYSTICK_WALKERA0701 is not set
# CONFIG_JOYSTICK_PXRC is not set
# CONFIG_JOYSTICK_QWIIC is not set
# CONFIG_JOYSTICK_FSIA6B is not set
# CONFIG_JOYSTICK_SENSEHAT is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_AUO_PIXCIR is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_BU21029 is not set
# CONFIG_TOUCHSCREEN_CHIPONE_ICN8505 is not set
# CONFIG_TOUCHSCREEN_CY8CTMA140 is not set
# CONFIG_TOUCHSCREEN_CY8CTMG110 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP5 is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
# CONFIG_TOUCHSCREEN_EXC3000 is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_GOODIX is not set
# CONFIG_TOUCHSCREEN_HIDEEP is not set
# CONFIG_TOUCHSCREEN_HYCON_HY46XX is not set
# CONFIG_TOUCHSCREEN_HYNITRON_CSTXXX is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_ILITEK is not set
# CONFIG_TOUCHSCREEN_S6SY761 is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_EKTF2127 is not set
# CONFIG_TOUCHSCREEN_ELAN is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MELFAS_MIP4 is not set
# CONFIG_TOUCHSCREEN_MSG2638 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_NOVATEK_NVT_TS is not set
# CONFIG_TOUCHSCREEN_IMAGIS is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WDT87XX_I2C is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2004 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_RM_TS is not set
# CONFIG_TOUCHSCREEN_SILEAD is not set
# CONFIG_TOUCHSCREEN_SIS_I2C is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_STMFTS is not set
# CONFIG_TOUCHSCREEN_SX8654 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZET6223 is not set
# CONFIG_TOUCHSCREEN_ZFORCE is not set
# CONFIG_TOUCHSCREEN_ROHM_BU21023 is not set
# CONFIG_TOUCHSCREEN_IQS5XX is not set
# CONFIG_TOUCHSCREEN_IQS7211 is not set
# CONFIG_TOUCHSCREEN_ZINITIX is not set
# CONFIG_TOUCHSCREEN_HIMAX_HX83112B is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_GPIO_BEEPER is not set
# CONFIG_INPUT_GPIO_DECODER is not set
# CONFIG_INPUT_GPIO_VIBRA is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_DA7280_HAPTICS is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_IQS269A is not set
# CONFIG_INPUT_IQS626A is not set
# CONFIG_INPUT_IQS7222 is not set
# CONFIG_INPUT_CMA3000 is not set
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=y
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV260X_HAPTICS is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_SERIO_GPIO_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_LEGACY_TIOCSTI=y
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCILIB=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_PCI1XXXX=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
CONFIG_SERIAL_8250_PERICOM=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# CONFIG_SERIAL_SPRD is not set
# end of Serial drivers

CONFIG_SERIAL_MCTRL_GPIO=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_N_HDLC is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_N_GSM is not set
# CONFIG_NOZOMI is not set
# CONFIG_NULL_TTY is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_PRINTER is not set
CONFIG_PPDEV=m
CONFIG_VIRTIO_CONSOLE=m
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_BA431 is not set
# CONFIG_HW_RANDOM_VIA is not set
# CONFIG_HW_RANDOM_VIRTIO is not set
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
CONFIG_NVRAM=y
CONFIG_DEVPORT=y
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# CONFIG_XILLYUSB is not set
# end of Character devices

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_AMD_MP2 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
CONFIG_I2C_PIIX4=m
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_CBUS_GPIO is not set
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_CP2615 is not set
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PCI1XXXX is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_VIRTIO is not set
# end of I2C Hardware Bus support

# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# end of I2C support

# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
# CONFIG_PPS is not set

#
# PTP clock support
#
# CONFIG_PTP_1588_CLOCK is not set
CONFIG_PTP_1588_CLOCK_OPTIONAL=y
# end of PTP clock support

CONFIG_PINCTRL=y
CONFIG_PINMUX=y
CONFIG_PINCONF=y
CONFIG_GENERIC_PINCONF=y
# CONFIG_DEBUG_PINCTRL is not set
# CONFIG_PINCTRL_AMD is not set
# CONFIG_PINCTRL_CY8C95X0 is not set
# CONFIG_PINCTRL_MCP23S08 is not set
# CONFIG_PINCTRL_SX150X is not set

#
# Intel pinctrl drivers
#
CONFIG_PINCTRL_BAYTRAIL=y
# CONFIG_PINCTRL_CHERRYVIEW is not set
# CONFIG_PINCTRL_LYNXPOINT is not set
CONFIG_PINCTRL_INTEL=y
# CONFIG_PINCTRL_ALDERLAKE is not set
# CONFIG_PINCTRL_BROXTON is not set
# CONFIG_PINCTRL_CANNONLAKE is not set
# CONFIG_PINCTRL_CEDARFORK is not set
# CONFIG_PINCTRL_DENVERTON is not set
# CONFIG_PINCTRL_ELKHARTLAKE is not set
# CONFIG_PINCTRL_EMMITSBURG is not set
# CONFIG_PINCTRL_GEMINILAKE is not set
# CONFIG_PINCTRL_ICELAKE is not set
# CONFIG_PINCTRL_JASPERLAKE is not set
# CONFIG_PINCTRL_LAKEFIELD is not set
# CONFIG_PINCTRL_LEWISBURG is not set
# CONFIG_PINCTRL_METEORLAKE is not set
# CONFIG_PINCTRL_SUNRISEPOINT is not set
# CONFIG_PINCTRL_TIGERLAKE is not set
# end of Intel pinctrl drivers

#
# Renesas pinctrl drivers
#
# end of Renesas pinctrl drivers

CONFIG_GPIOLIB=y
CONFIG_GPIOLIB_FASTPATH_LIMIT=512
CONFIG_GPIO_ACPI=y
CONFIG_GPIOLIB_IRQCHIP=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_CDEV=y
CONFIG_GPIO_CDEV_V1=y

#
# Memory mapped GPIO drivers
#
# CONFIG_GPIO_AMDPT is not set
# CONFIG_GPIO_DWAPB is not set
# CONFIG_GPIO_EXAR is not set
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_MB86S7X is not set
# CONFIG_GPIO_AMD_FCH is not set
# end of Memory mapped GPIO drivers

#
# Port-mapped I/O GPIO drivers
#
# CONFIG_GPIO_VX855 is not set
# CONFIG_GPIO_F7188X is not set
# CONFIG_GPIO_IT87 is not set
# CONFIG_GPIO_SCH311X is not set
# CONFIG_GPIO_WINBOND is not set
# CONFIG_GPIO_WS16C48 is not set
# end of Port-mapped I/O GPIO drivers

#
# I2C GPIO expanders
#
# CONFIG_GPIO_FXL6408 is not set
# CONFIG_GPIO_DS4520 is not set
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCA9570 is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_TPIC2810 is not set
# end of I2C GPIO expanders

#
# MFD GPIO expanders
#
# CONFIG_GPIO_ELKHARTLAKE is not set
# end of MFD GPIO expanders

#
# PCI GPIO expanders
#
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_PCI_IDIO_16 is not set
# CONFIG_GPIO_PCIE_IDIO_24 is not set
# CONFIG_GPIO_RDC321X is not set
# end of PCI GPIO expanders

#
# USB GPIO expanders
#
# end of USB GPIO expanders

#
# Virtual GPIO drivers
#
# CONFIG_GPIO_AGGREGATOR is not set
# CONFIG_GPIO_LATCH is not set
# CONFIG_GPIO_MOCKUP is not set
# CONFIG_GPIO_VIRTIO is not set
# CONFIG_GPIO_SIM is not set
# end of Virtual GPIO drivers

# CONFIG_W1 is not set
CONFIG_POWER_RESET=y
# CONFIG_POWER_RESET_RESTART is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_IP5XXX_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_CHARGER_ADP5061 is not set
# CONFIG_BATTERY_CW2015 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SAMSUNG_SDI is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_LT3651 is not set
# CONFIG_CHARGER_LTC4162L is not set
# CONFIG_CHARGER_MAX77976 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_BQ24257 is not set
# CONFIG_CHARGER_BQ24735 is not set
# CONFIG_CHARGER_BQ2515X is not set
# CONFIG_CHARGER_BQ25890 is not set
# CONFIG_CHARGER_BQ25980 is not set
# CONFIG_CHARGER_BQ256XX is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_BATTERY_GOLDFISH is not set
# CONFIG_BATTERY_RT5033 is not set
# CONFIG_CHARGER_RT9455 is not set
# CONFIG_CHARGER_BD99954 is not set
# CONFIG_BATTERY_UG3105 is not set
# CONFIG_FUEL_GAUGE_MM8013 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM1177 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_AHT10 is not set
# CONFIG_SENSORS_AQUACOMPUTER_D5NEXT is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_CORSAIR_PSU is not set
# CONFIG_SENSORS_DRIVETEMP is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_HS3001 is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWERZ is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2947_I2C is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC2992 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX127 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX31730 is not set
# CONFIG_SENSORS_MAX31760 is not set
# CONFIG_MAX31827 is not set
# CONFIG_SENSORS_MAX6620 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MC34VR500 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_TPS23861 is not set
# CONFIG_SENSORS_MR75203 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT6775_I2C is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_NZXT_KRAKEN2 is not set
# CONFIG_SENSORS_NZXT_SMART2 is not set
# CONFIG_SENSORS_OCC_P8_I2C is not set
# CONFIG_SENSORS_OXP is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SBTSI is not set
# CONFIG_SENSORS_SBRMI is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHT4x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC2305 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA238 is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TMP464 is not set
# CONFIG_SENSORS_TMP513 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83773G is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ASUS_EC is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_NETLINK is not set
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
# CONFIG_THERMAL_WRITABLE_TRIPS is not set
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
CONFIG_THERMAL_GOV_FAIR_SHARE=y
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_DEVFREQ_THERMAL is not set
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_X86_PKG_TEMP_THERMAL is not set
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

# CONFIG_INTEL_PCH_THERMAL is not set
# CONFIG_INTEL_TCC_COOLING is not set
# CONFIG_INTEL_HFI_THERMAL is not set
# end of Intel thermal drivers

CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
# CONFIG_WATCHDOG_SYSFS is not set
# CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ADVANTECH_EC_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_EBC_C384_WDT is not set
# CONFIG_EXAR_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_TQMX86_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set
# CONFIG_MEN_A21_WDT is not set
# CONFIG_XEN_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_AS3711 is not set
# CONFIG_MFD_SMPRO is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_AAT2870_CORE is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CS42L43_I2C is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_MFD_MP2629 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_INTEL_PMC_BXT is not set
# CONFIG_MFD_IQS62X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77541 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6360 is not set
# CONFIG_MFD_MT6370 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_SY7636A is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT4831 is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RT5120 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65910 is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS6594_I2C is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_ATC260X_I2C is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
CONFIG_RC_CORE=y
# CONFIG_LIRC is not set
CONFIG_RC_MAP=y
CONFIG_RC_DECODERS=y
# CONFIG_IR_IMON_DECODER is not set
CONFIG_IR_JVC_DECODER=y
CONFIG_IR_MCE_KBD_DECODER=y
CONFIG_IR_NEC_DECODER=y
CONFIG_IR_RC5_DECODER=y
CONFIG_IR_RC6_DECODER=y
# CONFIG_IR_RCMM_DECODER is not set
CONFIG_IR_SANYO_DECODER=y
CONFIG_IR_SHARP_DECODER=y
CONFIG_IR_SONY_DECODER=y
CONFIG_IR_XMP_DECODER=y
# CONFIG_RC_DEVICES is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_APERTURE_HELPERS=y
CONFIG_VIDEO_CMDLINE=y
CONFIG_AUXDISPLAY=y
# CONFIG_HD44780 is not set
# CONFIG_KS0108 is not set
# CONFIG_IMG_ASCII_LCD is not set
# CONFIG_HT16K33 is not set
# CONFIG_LCD2S is not set
# CONFIG_PARPORT_PANEL is not set
# CONFIG_CHARLCD_BL_OFF is not set
# CONFIG_CHARLCD_BL_ON is not set
CONFIG_CHARLCD_BL_FLASH=y
# CONFIG_PANEL is not set
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_INTEL_GTT=y
CONFIG_VGA_SWITCHEROO=y
# CONFIG_DRM is not set
CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SSD1307 is not set
# CONFIG_FB_SM712 is not set
CONFIG_FB_CORE=y
CONFIG_FB_NOTIFY=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DEVICE=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_IOMEM_HELPERS=y
CONFIG_FB_SYSMEM_HELPERS=y
CONFIG_FB_SYSMEM_HELPERS_DEFERRED=y
# CONFIG_FB_MODE_HELPERS is not set
CONFIG_FB_TILEBLITTING=y
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_KTD253 is not set
# CONFIG_BACKLIGHT_KTZ8866 is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_QCOM_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_GPIO is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# end of Backlight & LCD device support

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION is not set
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
# end of Console display driver support

CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# end of Graphics support

CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_SEQ_DEVICE=m
CONFIG_SND_JACK=y
CONFIG_SND_JACK_INPUT_DEV=y
CONFIG_SND_OSSEMUL=y
# CONFIG_SND_MIXER_OSS is not set
# CONFIG_SND_PCM_OSS is not set
CONFIG_SND_PCM_TIMER=y
# CONFIG_SND_HRTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_MAX_CARDS=32
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_PROC_FS=y
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_VERBOSE_PRINTK=y
CONFIG_SND_CTL_FAST_LOOKUP=y
CONFIG_SND_DEBUG=y
# CONFIG_SND_DEBUG_VERBOSE is not set
CONFIG_SND_PCM_XRUN_DEBUG=y
# CONFIG_SND_CTL_INPUT_VALIDATION is not set
# CONFIG_SND_CTL_DEBUG is not set
# CONFIG_SND_JACK_INJECTION_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_SEQUENCER=m
# CONFIG_SND_SEQ_DUMMY is not set
CONFIG_SND_SEQUENCER_OSS=m
CONFIG_SND_SEQ_MIDI_EVENT=m
# CONFIG_SND_SEQ_UMP is not set
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
# CONFIG_SND_PCMTEST is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_PORTMAN2X4 is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SE6X is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# HD-Audio
#
CONFIG_SND_HDA=m
CONFIG_SND_HDA_INTEL=m
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=0
CONFIG_SND_HDA_PATCH_LOADER=y
# CONFIG_SND_HDA_CODEC_REALTEK is not set
# CONFIG_SND_HDA_CODEC_ANALOG is not set
# CONFIG_SND_HDA_CODEC_SIGMATEL is not set
# CONFIG_SND_HDA_CODEC_VIA is not set
# CONFIG_SND_HDA_CODEC_HDMI is not set
# CONFIG_SND_HDA_CODEC_CIRRUS is not set
# CONFIG_SND_HDA_CODEC_CS8409 is not set
# CONFIG_SND_HDA_CODEC_CONEXANT is not set
# CONFIG_SND_HDA_CODEC_CA0110 is not set
# CONFIG_SND_HDA_CODEC_CA0132 is not set
# CONFIG_SND_HDA_CODEC_CMEDIA is not set
# CONFIG_SND_HDA_CODEC_SI3054 is not set
CONFIG_SND_HDA_GENERIC=m
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDA_INTEL_HDMI_SILENT_STREAM is not set
# CONFIG_SND_HDA_CTL_DEV_ID is not set
# end of HD-Audio

CONFIG_SND_HDA_CORE=m
CONFIG_SND_HDA_PREALLOC_SIZE=0
CONFIG_SND_INTEL_NHLT=y
CONFIG_SND_INTEL_DSP_CONFIG=m
CONFIG_SND_INTEL_SOUNDWIRE_ACPI=m
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_BCD2000 is not set
# CONFIG_SND_USB_POD is not set
# CONFIG_SND_USB_PODHD is not set
# CONFIG_SND_USB_TONEPORT is not set
# CONFIG_SND_USB_VARIAX is not set
# CONFIG_SND_PCMCIA is not set
# CONFIG_SND_SOC is not set
CONFIG_SND_X86=y
# CONFIG_SND_XEN_FRONTEND is not set
# CONFIG_SND_VIRTIO is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_BIGBEN_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CP2112 is not set
# CONFIG_HID_CREATIVE_SB0540 is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELAN is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EVISION is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_FT260 is not set
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GOOGLE_STADIA_FF is not set
# CONFIG_HID_VIVALDI is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_VRC2 is not set
# CONFIG_HID_XIAOMI is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
CONFIG_HID_ITE=y
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
# CONFIG_HID_LETSKETCH is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
CONFIG_LOGIWHEELS_FF=y
CONFIG_HID_MAGICMOUSE=y
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
# CONFIG_HID_MEGAWORLD_FF is not set
CONFIG_HID_REDRAGON=y
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NINTENDO is not set
# CONFIG_HID_NTI is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
CONFIG_HID_PLANTRONICS=y
# CONFIG_HID_PXRC is not set
# CONFIG_HID_RAZER is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SEMITEK is not set
# CONFIG_HID_SIGMAMICRO is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_TOPRE is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_U2FZERO is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# CONFIG_HID_MCP2221 is not set
# end of Special HID drivers

#
# HID-BPF support
#
# end of HID-BPF support

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
# end of USB HID support

CONFIG_I2C_HID=y
# CONFIG_I2C_HID_ACPI is not set
# CONFIG_I2C_HID_OF is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
# end of Intel ISH HID support

#
# AMD SFH HID Support
#
# CONFIG_AMD_SFH_HID is not set
# end of AMD SFH HID Support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_LED_TRIG=y
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_USB_CONN_GPIO is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_FEW_INIT_RETRIES is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_PRODUCTLIST is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2
CONFIG_USB_MON=y

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
# CONFIG_USB_XHCI_DBGCAP is not set
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PCI_RENESAS is not set
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_FSL is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set
# CONFIG_USB_XEN_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set

#
# USB dual-mode controller drivers
#
# CONFIG_USB_CDNS_SUPPORT is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_XR is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_USS720 is not set
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_APPLE_MFI_FASTCHARGE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_USB_ISP1301 is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_CLASS_MULTICOLOR is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_APU is not set
# CONFIG_LEDS_AW200XX is not set
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3532 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP3952 is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_PCA995X is not set
# CONFIG_LEDS_BD2606MVV is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set
# CONFIG_LEDS_IS31FL319X is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_MLXREG is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set

#
# Flash and Torch LED drivers
#

#
# RGB LED drivers
#

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_ACTIVITY is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
# CONFIG_LEDS_TRIGGER_NETDEV is not set
# CONFIG_LEDS_TRIGGER_PATTERN is not set
# CONFIG_LEDS_TRIGGER_AUDIO is not set
# CONFIG_LEDS_TRIGGER_TTY is not set

#
# Simple LED drivers
#
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# end of Speakup console speech

# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_DECODE_MCE is not set
# CONFIG_EDAC_GHES is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_I10NM is not set
# CONFIG_EDAC_PND2 is not set
# CONFIG_EDAC_IGEN6 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_SYSTOHC is not set
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABEOZ9 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3028 is not set
# CONFIG_RTC_DRV_RV3032 is not set
# CONFIG_RTC_DRV_RV8803 is not set
# CONFIG_RTC_DRV_SD3078 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set
# CONFIG_RTC_DRV_RX6110 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_RP5C01 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_GOLDFISH is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IDXD is not set
# CONFIG_INTEL_IDXD_COMPAT is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_PLX_DMA is not set
# CONFIG_XILINX_DMA is not set
# CONFIG_XILINX_XDMA is not set
# CONFIG_AMD_PTDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
CONFIG_DW_DMAC_PCI=y
# CONFIG_DW_EDMA is not set
CONFIG_HSU_DMA=y
# CONFIG_SF_PDMA is not set
# CONFIG_INTEL_LDMA is not set

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
# CONFIG_SW_SYNC is not set
CONFIG_UDMABUF=y
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_DEBUG is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# CONFIG_DMABUF_SYSFS_STATS is not set
# end of DMABUF options

# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_VIRT_DRIVERS=y
CONFIG_VMGENID=y
# CONFIG_VBOXGUEST is not set
# CONFIG_NITRO_ENCLAVES is not set
# CONFIG_EFI_SECRET is not set
CONFIG_VIRTIO_ANCHOR=y
CONFIG_VIRTIO=m
CONFIG_VIRTIO_PCI_LIB=m
CONFIG_VIRTIO_PCI_LIB_LEGACY=m
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_PCI_LEGACY=y
CONFIG_VIRTIO_BALLOON=m
CONFIG_VIRTIO_MEM=m
# CONFIG_VIRTIO_INPUT is not set
CONFIG_VIRTIO_MMIO=m
CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y
# CONFIG_VDPA is not set
CONFIG_VHOST_MENU=y
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# end of Microsoft Hyper-V guest support

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
# CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not set
CONFIG_XEN_MEMORY_HOTPLUG_LIMIT=512
CONFIG_XEN_SCRUB_PAGES_DEFAULT=y
# CONFIG_XEN_DEV_EVTCHN is not set
CONFIG_XEN_BACKEND=y
# CONFIG_XENFS is not set
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
# CONFIG_XEN_GNTDEV is not set
# CONFIG_XEN_GRANT_DEV_ALLOC is not set
# CONFIG_XEN_GRANT_DMA_ALLOC is not set
CONFIG_SWIOTLB_XEN=y
# CONFIG_XEN_PCIDEV_BACKEND is not set
# CONFIG_XEN_PVCALLS_FRONTEND is not set
# CONFIG_XEN_PVCALLS_BACKEND is not set
CONFIG_XEN_PRIVCMD=m
# CONFIG_XEN_ACPI_PROCESSOR is not set
# CONFIG_XEN_MCE_LOG is not set
CONFIG_XEN_HAVE_PVMMU=y
CONFIG_XEN_EFI=y
CONFIG_XEN_AUTO_XLATE=y
CONFIG_XEN_ACPI=y
CONFIG_XEN_HAVE_VPMU=y
# CONFIG_XEN_VIRTIO is not set
# end of Xen driver support

# CONFIG_GREYBUS is not set
# CONFIG_COMEDI is not set
CONFIG_STAGING=y
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
# CONFIG_RTS5208 is not set
# CONFIG_FB_SM750 is not set
CONFIG_STAGING_MEDIA=y
# CONFIG_LTE_GDM724X is not set
# CONFIG_FIELDBUS_DEV is not set
# CONFIG_QLGE is not set
# CONFIG_VME_BUS is not set
CONFIG_CHROME_PLATFORMS=y
# CONFIG_CHROMEOS_ACPI is not set
# CONFIG_CHROMEOS_LAPTOP is not set
# CONFIG_CHROMEOS_PSTORE is not set
# CONFIG_CHROMEOS_TBMC is not set
# CONFIG_CROS_EC is not set
# CONFIG_CROS_KBD_LED_BACKLIGHT is not set
# CONFIG_CROS_HPS_I2C is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_SURFACE_PLATFORMS=y
# CONFIG_SURFACE_3_POWER_OPREGION is not set
# CONFIG_SURFACE_GPE is not set
# CONFIG_SURFACE_HOTPLUG is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACPI_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ACER_WIRELESS is not set
# CONFIG_AMD_PMF is not set
# CONFIG_AMD_PMC is not set
# CONFIG_AMD_HSMP is not set
# CONFIG_ADV_SWBUTTON is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ASUS_TF103C_DOCK is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_X86_PLATFORM_DRIVERS_DELL is not set
# CONFIG_AMILO_RFKILL is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_GPD_POCKET_FAN is not set
# CONFIG_X86_PLATFORM_DRIVERS_HP is not set
# CONFIG_WIRELESS_HOTKEY is not set
# CONFIG_IBM_RTL is not set
# CONFIG_IDEAPAD_LAPTOP is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_ATOMISP2_PM is not set
# CONFIG_INTEL_IFS is not set
# CONFIG_INTEL_SAR_INT1092 is not set
# CONFIG_INTEL_PMC_CORE is not set

#
# Intel Speed Select Technology interface support
#
# CONFIG_INTEL_SPEED_SELECT_INTERFACE is not set
# end of Intel Speed Select Technology interface support

#
# Intel Uncore Frequency Control
#
# CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set
# end of Intel Uncore Frequency Control

# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_INTEL_INT0002_VGPIO is not set
# CONFIG_INTEL_OAKTRAIL is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_INTEL_RST is not set
CONFIG_INTEL_SMARTCONNECT=y
# CONFIG_INTEL_TURBO_MAX_3 is not set
# CONFIG_INTEL_VSEC is not set
# CONFIG_MSI_EC is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PCENGINES_APU2 is not set
# CONFIG_BARCO_P50_GPIO is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_SYSTEM76_ACPI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_MLX_PLATFORM is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_SCU_PCI is not set
# CONFIG_INTEL_SCU_PLATFORM is not set
# CONFIG_SIEMENS_SIMATIC_IPC is not set
# CONFIG_WINMATE_FM07_KEYS is not set
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
# CONFIG_COMMON_CLK_MAX9485 is not set
# CONFIG_COMMON_CLK_SI5341 is not set
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_SI544 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_XILINX_VCU is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IO_PGTABLE=y
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_PERF_EVENTS=y
# CONFIG_IOMMUFD is not set
CONFIG_IRQ_REMAP=y
# CONFIG_VIRTIO_IOMMU is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# fujitsu SoC drivers
#
# end of fujitsu SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

# CONFIG_WPCM450_SOC is not set

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
# CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND is not set
# CONFIG_DEVFREQ_GOV_PERFORMANCE is not set
# CONFIG_DEVFREQ_GOV_POWERSAVE is not set
# CONFIG_DEVFREQ_GOV_USERSPACE is not set
# CONFIG_DEVFREQ_GOV_PASSIVE is not set

#
# DEVFREQ Drivers
#
# CONFIG_PM_DEVFREQ_EVENT is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
CONFIG_RESET_CONTROLLER=y
# CONFIG_RESET_TI_SYSCON is not set
# CONFIG_RESET_TI_TPS380X is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_USB_LGM_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_RTK_RTD_USB2PHY is not set
# CONFIG_PHY_RTK_RTD_USB3PHY is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

CONFIG_POWERCAP=y
# CONFIG_INTEL_RAPL is not set
# CONFIG_IDLE_INJECT is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID_BINDER_IPC is not set
# end of Android

# CONFIG_LIBNVDIMM is not set
CONFIG_DAX=y
# CONFIG_DEV_DAX is not set
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y

#
# Layout Types
#
# CONFIG_NVMEM_LAYOUT_SL28_VPD is not set
# CONFIG_NVMEM_LAYOUT_ONIE_TLV is not set
# end of Layout Types

# CONFIG_NVMEM_RMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
CONFIG_PM_OPP=y
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# CONFIG_PECI is not set
# CONFIG_HTE is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_VALIDATE_FS_PARSER=y
CONFIG_FS_IOMAP=y
CONFIG_BUFFER_HEAD=y
CONFIG_LEGACY_DIRECT_IO=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_SUPPORT_ASCII_CI=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set
# CONFIG_XFS_ONLINE_SCRUB is not set
# CONFIG_XFS_WARN is not set
# CONFIG_XFS_DEBUG is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
# CONFIG_BTRFS_FS_REF_VERIFY is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_BCACHEFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
# CONFIG_FS_ENCRYPTION is not set
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set
# CONFIG_VIRTIO_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
CONFIG_NETFS_SUPPORT=m
# CONFIG_NETFS_STATS is not set
# CONFIG_FSCACHE is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_NTFS3_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
# CONFIG_PROC_VMCORE_DEVICE_DUMP is not set
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
# CONFIG_TMPFS_INODE64 is not set
# CONFIG_TMPFS_QUOTA is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
CONFIG_MINIX_FS=m
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_COMPRESS=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_PSTORE_BLK is not set
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
# CONFIG_EROFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
CONFIG_NFSD=m
# CONFIG_NFSD_V2 is not set
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
CONFIG_NFSD_V4_SECURITY_LABEL=y
CONFIG_GRACE_PERIOD=m
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_SUNRPC_DEBUG=y
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_SMB_SERVER is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=m
CONFIG_9P_FS_POSIX_ACL=y
CONFIG_9P_FS_SECURITY=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=m
CONFIG_NLS_UCS2_UTILS=m
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
CONFIG_PERSISTENT_KEYRINGS=y
# CONFIG_TRUSTED_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9
CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256
# CONFIG_SECURITY_SELINUX_DEBUG is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_SECURITY_SAFESETID is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
# CONFIG_SECURITY_LANDLOCK is not set
# CONFIG_INTEGRITY is not set
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_LSM="yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_BARE=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_STACK_ALL_PATTERN is not set
# CONFIG_INIT_STACK_ALL_ZERO is not set
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
# CONFIG_ZERO_CALL_USED_REGS is not set
# end of Memory initialization

#
# Hardening of kernel data structures
#
CONFIG_LIST_HARDENED=y
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Hardening of kernel data structures

CONFIG_RANDSTRUCT_NONE=y
# end of Kernel hardening options
# end of Security options

CONFIG_XOR_BLOCKS=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SIG2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_KPP=m
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
# CONFIG_CRYPTO_MANAGER_EXTRA_TESTS is not set
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_SIMD=y
CONFIG_CRYPTO_ENGINE=m
# end of Crypto core or helper

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
CONFIG_CRYPTO_ECC=m
CONFIG_CRYPTO_ECDH=m
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# end of Public-key cryptography

#
# Block ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARIA is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SM4_GENERIC is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# end of Block ciphers

#
# Length-preserving ciphers and modes
#
# CONFIG_CRYPTO_ADIANTUM is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_HCTR2 is not set
# CONFIG_CRYPTO_KEYWRAP is not set
CONFIG_CRYPTO_LRW=y
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# end of Length-preserving ciphers and modes

#
# AEAD (authenticated encryption with associated data) ciphers
#
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_CCM is not set
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_GENIV=y
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=m
# CONFIG_CRYPTO_ESSIV is not set
# end of AEAD (authenticated encryption with associated data) ciphers

#
# Hashes, digests, and MACs
#
CONFIG_CRYPTO_BLAKE2B=m
CONFIG_CRYPTO_CMAC=m
CONFIG_CRYPTO_GHASH=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_SHA3=y
# CONFIG_CRYPTO_SM3_GENERIC is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_VMAC is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_XCBC is not set
CONFIG_CRYPTO_XXHASH=m
# end of Hashes, digests, and MACs

#
# CRCs (cyclic redundancy checks)
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32 is not set
CONFIG_CRYPTO_CRCT10DIF=y
CONFIG_CRYPTO_CRC64_ROCKSOFT=y
# end of CRCs (cyclic redundancy checks)

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_LZO=y
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set
# end of Compression

#
# Random number generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
# CONFIG_CRYPTO_JITTERENTROPY_TESTINTERFACE is not set
# end of Random number generation

#
# Userspace interface
#
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_USER_API_ENABLE_OBSOLETE=y
# end of Userspace interface

CONFIG_CRYPTO_HASH_INFO=y

#
# Accelerated Cryptographic Algorithms for CPU (x86)
#
# CONFIG_CRYPTO_CURVE25519_X86 is not set
CONFIG_CRYPTO_AES_NI_INTEL=y
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_ARIA_GFNI_AVX512_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
# CONFIG_CRYPTO_POLYVAL_CLMUL_NI is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SM3_AVX_X86_64 is not set
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=m
CONFIG_CRYPTO_CRC32C_INTEL=m
CONFIG_CRYPTO_CRC32_PCLMUL=m
CONFIG_CRYPTO_CRCT10DIF_PCLMUL=m
# end of Accelerated Cryptographic Algorithms for CPU (x86)

CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_ATMEL_ECC is not set
# CONFIG_CRYPTO_DEV_ATMEL_SHA204A is not set
CONFIG_CRYPTO_DEV_CCP=y
# CONFIG_CRYPTO_DEV_CCP_DD is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_4XXX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
CONFIG_CRYPTO_DEV_VIRTIO=m
# CONFIG_CRYPTO_DEV_SAFEXCEL is not set
# CONFIG_CRYPTO_DEV_AMLOGIC_GXL is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
# CONFIG_PKCS8_PRIVATE_KEY_PARSER is not set
CONFIG_PKCS7_MESSAGE_PARSER=y
# CONFIG_FIPS_SIGNATURE_SELFTEST is not set

#
# Certificates for signature checking
#
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
# CONFIG_SYSTEM_EXTRA_CERTIFICATE is not set
# CONFIG_SECONDARY_TRUSTED_KEYRING is not set
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_RAID6_PQ_BENCHMARK=y
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_UTILS=y
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_GF128MUL=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC64_ROCKSOFT=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
CONFIG_CRC64=y
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=m
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_ZSTD_COMMON=y
CONFIG_ZSTD_COMPRESS=m
CONFIG_ZSTD_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
# CONFIG_XZ_DEC_MICROLZMA is not set
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_FLAGS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
# CONFIG_SWIOTLB_DYNAMIC is not set
# CONFIG_DMA_CMA is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_DMA_MAP_BENCHMARK is not set
CONFIG_SGL_ALLOC=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
# CONFIG_IRQ_POLL is not set
CONFIG_MPILIB=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_STACKDEPOT=y
CONFIG_STACKDEPOT_ALWAYS_INIT=y
CONFIG_SBITMAP=y
# CONFIG_LWQ_TEST is not set
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
# CONFIG_PRINTK_CALLER is not set
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_BOOT_PRINTK_DELAY=y
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_AS_HAS_NON_CONST_LEB128=y
# CONFIG_DEBUG_INFO_NONE is not set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
# CONFIG_DEBUG_INFO_REDUCED is not set
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
# CONFIG_DEBUG_INFO_COMPRESSED_ZLIB is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_LANG_EXCLUDE=y
# CONFIG_GDB_SCRIPTS is not set
CONFIG_FRAME_WARN=2048
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
# CONFIG_HEADERS_INSTALL is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_OBJTOOL=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x0
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_HONOUR_BLOCKLIST=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_DEFAULT_ENABLE=0x1
CONFIG_KDB_KEYBOARD=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_ARCH_HAS_EARLY_DEBUG=y
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# CONFIG_KCSAN is not set
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
# CONFIG_NET_DEV_REFCNT_TRACKER is not set
# CONFIG_NET_NS_REFCNT_TRACKER is not set
# CONFIG_DEBUG_NET is not set
# end of Networking Debugging

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_DEBUG_ON=y
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_TABLE_CHECK is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
CONFIG_PER_VMA_LOCK_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SHRINKER_DEBUG is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_MAPLE_TREE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
CONFIG_HAVE_ARCH_KFENCE=y
# CONFIG_KFENCE is not set
CONFIG_HAVE_ARCH_KMSAN=y
# end of Memory Debugging

CONFIG_DEBUG_SHIRQ=y

#
# Debug Oops, Lockups and Hangs
#
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_HARDLOCKUP_DETECTOR_PREFER_BUDDY is not set
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
# CONFIG_HARDLOCKUP_DETECTOR_BUDDY is not set
# CONFIG_HARDLOCKUP_DETECTOR_ARCH is not set
CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_WQ_CPU_INTENSIVE_REPORT is not set
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
# CONFIG_SCHEDSTATS is not set
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set
CONFIG_DEBUG_PREEMPT=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

# CONFIG_NMI_CHECK_CPU is not set
# CONFIG_DEBUG_IRQFLAGS is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_MAPLE_TREE is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_DEBUG_CGROUP_REF is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_RETHOOK=y
CONFIG_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_RETVAL=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_OBJTOOL_NOP_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_BOOTTIME_TRACING is not set
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_FUNCTION_GRAPH_RETVAL is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FPROBE is not set
CONFIG_FUNCTION_PROFILER=y
CONFIG_STACK_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
CONFIG_SCHED_TRACER=y
# CONFIG_HWLAT_TRACER is not set
# CONFIG_OSNOISE_TRACER is not set
# CONFIG_TIMERLAT_TRACER is not set
# CONFIG_MMIOTRACE is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
# CONFIG_UPROBE_EVENTS is not set
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
# CONFIG_SYNTH_EVENTS is not set
# CONFIG_USER_EVENTS is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_FTRACE_SORT_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_RV is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_IO_STRICT_DEVMEM is not set

#
# x86 Debugging
#
CONFIG_EARLY_PRINTK_USB=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_X86_DECODER_SELFTEST=y
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
# CONFIG_FAULT_INJECTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_TEST_DHRY is not set
# CONFIG_LKDTM is not set
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_DIV64 is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_TEST_REF_TRACKER is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
CONFIG_ATOMIC64_SELFTEST=y
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_STRING_SELFTEST is not set
# CONFIG_TEST_STRING_HELPERS is not set
CONFIG_TEST_KSTRTOX=y
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_SCANF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
CONFIG_TEST_XARRAY=m
# CONFIG_TEST_MAPLE_TREE is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_IDA is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_DYNAMIC_DEBUG is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FREE_PAGES is not set
# CONFIG_TEST_FPU is not set
# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set
CONFIG_ARCH_USE_MEMTEST=y
# CONFIG_MEMTEST is not set
# end of Kernel Testing and Coverage

#
# Rust hacking
#
# end of Rust hacking
# end of Kernel hacking

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-19 18:47               ` Mike Kravetz
@ 2023-09-19 20:57                 ` Zi Yan
  2023-09-20  0:32                   ` Mike Kravetz
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-19 20:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Johannes Weiner, Vlastimil Babka, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 11717 bytes --]

On 19 Sep 2023, at 14:47, Mike Kravetz wrote:

> On 09/19/23 02:49, Johannes Weiner wrote:
>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>>
>>>>>> With the patch below applied, a slightly different workload triggers the
>>>>>> following warnings.  It seems related, and appears to go away when
>>>>>> reverting the series.
>>>>>>
>>>>>> [  331.595382] ------------[ cut here ]------------
>>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>>
>>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>>> HIGHATOMIC blocks?
>>
>> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
>> show any CMA pages.
>>
>> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
>> and HIGHATOMIC.
>>
>>>> This means we have an order-10 page where one half is MOVABLE and the
>>>> other is CMA.
>>
>> This means the scenario is different:
>>
>> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
>> that the first pageblock is indeed MOVABLE. During the expand, the
>> second pageblock turns out to be of type MIGRATE_ISOLATE.
>>
>> The page allocator wouldn't have merged those types. It triggers a bit
>> too fast to be a race condition.
>>
>> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
>> while the head is on the list, and then stranded there.
>>
>> Could this be an issue in the page_isolation code? Maybe a range
>> rounding error?
>>
>> Zi Yan, does this ring a bell for you?
>>
>> I don't quite see how my patches could have caused this. But AFAICS we
>> also didn't have warnings for this scenario so it could be an old bug.
>>
>>>> Mike, could you describe the workload that is triggering this?
>>>
>>> This 'slightly different workload' is actually a slightly different
>>> environment.  Sorry for mis-speaking!  The slight difference is that this
>>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>>> (hugetlb_cma) feature that triggered the previous issue.
>>>
>>> This is still on a 16G VM.  Kernel command line here is:
>>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>>> hugetlb_free_vmemmap=on"
>>>
>>> The workload is just running this script:
>>> while true; do
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>>>
>>>> Does this reproduce instantly and reliably?
>>>>
>>>
>>> It is not 'instant' but will reproduce fairly reliably within a minute
>>> or so.
>>>
>>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>>> will eventually be freed via __free_pages(folio, 9).
>>
>> No luck reproducing this yet, but I have a question. In that crash
>> stack trace, the expand() is called via this:
>>
>>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>>  [  331.647977]  __alloc_pages+0xec/0x240
>>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>>  [  331.651909]  set_max_huge_pages+0x17d/0x390
>>
>> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
>> alloc_fresh_hugetlb_folio(), which has this:
>>
>>         if (hstate_is_gigantic(h))
>>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>>         else
>>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>>                                 nid, nmask, node_alloc_noretry);
>>
>> where gigantic is defined as the order exceeding MAX_ORDER, which
>> should be the case for 1G pages on x86.
>>
>> So the crashing stack must be from a 2M allocation, no? I'm confused
>> how that could happen with the above test case.
>
> Sorry for causing the confusion!
>
> When I originally saw the warnings pop up, I was running the above script
> as well as another that only allocated order 9 hugetlb pages:
>
> while true; do
> 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
>
> The warnings were actually triggered by allocations in this second script.
>
> However, when reporting the warnings I wanted to include the simplest
> way to recreate.  And, I noticed that that second script running in
> parallel was not required.  Again, sorry for the confusion!  Here is a
> warning triggered via the alloc_contig_range path only running the one
> script.
>
> [  107.275821] ------------[ cut here ]------------
> [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
> [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
> [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
> [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
> [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
> [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
> [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
> [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
> [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
> [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
> [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
> [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.321575] Call Trace:
> [  107.322314]  <TASK>
> [  107.323002]  ? del_page_from_free_list+0x137/0x170
> [  107.324380]  ? __warn+0x7d/0x130
> [  107.325341]  ? del_page_from_free_list+0x137/0x170
> [  107.326627]  ? report_bug+0x18d/0x1c0
> [  107.327632]  ? prb_read_valid+0x17/0x20
> [  107.328711]  ? handle_bug+0x41/0x70
> [  107.329685]  ? exc_invalid_op+0x13/0x60
> [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
> [  107.331937]  ? del_page_from_free_list+0x137/0x170
> [  107.333189]  __free_one_page+0x2ab/0x6f0
> [  107.334375]  free_pcppages_bulk+0x169/0x210
> [  107.335575]  drain_pages_zone+0x3f/0x50
> [  107.336691]  __drain_all_pages+0xe2/0x1e0
> [  107.337843]  alloc_contig_range+0x143/0x280
> [  107.339026]  alloc_contig_pages+0x210/0x270
> [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
> [  107.341529]  alloc_pool_huge_page+0x7d/0x100
> [  107.342745]  set_max_huge_pages+0x162/0x340
> [  107.345059]  nr_hugepages_store_common+0x91/0xf0
> [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
> [  107.347547]  vfs_write+0x207/0x400
> [  107.348543]  ksys_write+0x63/0xe0
> [  107.349511]  do_syscall_64+0x37/0x90
> [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  107.351940] RIP: 0033:0x7fabb8daee87
> [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
> [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
> [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
> [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
> [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
> [  107.365968]  </TASK>
> [  107.366534] ---[ end trace 0000000000000000 ]---
> [  121.542474] ------------[ cut here ]------------
>
> Perhaps that is another piece of information in that the warning can be
> triggered via both allocation paths.
>
> To be perfectly clear, here is what I did today:
> - built next-20230919.  It does not contain your series
>   	I could not recreate the issue.
> - Added your series and the patch to remove
>   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
> 	I could recreate the issue while running only the one script.
> 	The warning above is from that run.
> - Added this suggested patch from Zi
> 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> 	index 1400e674ab86..77a4aea31a7f 100644
> 	--- a/mm/page_alloc.c
> 	+++ b/mm/page_alloc.c
> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>  		end = pageblock_end_pfn(pfn) - 1;
>
>  		/* Do not cross zone boundaries */
> 	+#if 0
>  		if (!zone_spans_pfn(zone, start))
> 			start = zone->zone_start_pfn;
> 	+#else
> 	+	if (!zone_spans_pfn(zone, start))
> 	+		start = pfn;
> 	+#endif
> 	 	if (!zone_spans_pfn(zone, end))
> 	 		return false;
> 	I can still trigger warnings.

OK. One thing to note is that the page type in the warning changed from
5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.

>
> One idea about recreating the issue is that it may have to do with size
> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> to really stress the allocations by increasing the number of hugetlb
> pages requested and that did not help.  I also noticed that I only seem
> to get two warnings and then they stop, even if I continue to run the
> script.
>
> Zi asked about my config, so it is attached.

With your config, I still have no luck reproducing the issue. I will keep
trying. Thanks.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-19 20:57                 ` Zi Yan
@ 2023-09-20  0:32                   ` Mike Kravetz
  2023-09-20  1:38                     ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Mike Kravetz @ 2023-09-20  0:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Johannes Weiner, Vlastimil Babka, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On 09/19/23 16:57, Zi Yan wrote:
> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> 
> > On 09/19/23 02:49, Johannes Weiner wrote:
> >> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> >>> On 09/18/23 10:52, Johannes Weiner wrote:
> >>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> >>>>> On 9/16/23 21:57, Mike Kravetz wrote:
> >>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
> >>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >
> > Sorry for causing the confusion!
> >
> > When I originally saw the warnings pop up, I was running the above script
> > as well as another that only allocated order 9 hugetlb pages:
> >
> > while true; do
> > 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> >
> > The warnings were actually triggered by allocations in this second script.
> >
> > However, when reporting the warnings I wanted to include the simplest
> > way to recreate.  And, I noticed that that second script running in
> > parallel was not required.  Again, sorry for the confusion!  Here is a
> > warning triggered via the alloc_contig_range path only running the one
> > script.
> >
> > [  107.275821] ------------[ cut here ]------------
> > [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
> > [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
> > [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> > [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
> > [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> > [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
> > [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
> > [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
> > [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> > [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
> > [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> > [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
> > [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
> > [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
> > [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
> > [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  107.321575] Call Trace:
> > [  107.322314]  <TASK>
> > [  107.323002]  ? del_page_from_free_list+0x137/0x170
> > [  107.324380]  ? __warn+0x7d/0x130
> > [  107.325341]  ? del_page_from_free_list+0x137/0x170
> > [  107.326627]  ? report_bug+0x18d/0x1c0
> > [  107.327632]  ? prb_read_valid+0x17/0x20
> > [  107.328711]  ? handle_bug+0x41/0x70
> > [  107.329685]  ? exc_invalid_op+0x13/0x60
> > [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
> > [  107.331937]  ? del_page_from_free_list+0x137/0x170
> > [  107.333189]  __free_one_page+0x2ab/0x6f0
> > [  107.334375]  free_pcppages_bulk+0x169/0x210
> > [  107.335575]  drain_pages_zone+0x3f/0x50
> > [  107.336691]  __drain_all_pages+0xe2/0x1e0
> > [  107.337843]  alloc_contig_range+0x143/0x280
> > [  107.339026]  alloc_contig_pages+0x210/0x270
> > [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
> > [  107.341529]  alloc_pool_huge_page+0x7d/0x100
> > [  107.342745]  set_max_huge_pages+0x162/0x340
> > [  107.345059]  nr_hugepages_store_common+0x91/0xf0
> > [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
> > [  107.347547]  vfs_write+0x207/0x400
> > [  107.348543]  ksys_write+0x63/0xe0
> > [  107.349511]  do_syscall_64+0x37/0x90
> > [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > [  107.351940] RIP: 0033:0x7fabb8daee87
> > [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
> > [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
> > [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
> > [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
> > [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
> > [  107.365968]  </TASK>
> > [  107.366534] ---[ end trace 0000000000000000 ]---
> > [  121.542474] ------------[ cut here ]------------
> >
> > Perhaps that is another piece of information in that the warning can be
> > triggered via both allocation paths.
> >
> > To be perfectly clear, here is what I did today:
> > - built next-20230919.  It does not contain your series
> >   	I could not recreate the issue.
> > - Added your series and the patch to remove
> >   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
> > 	I could recreate the issue while running only the one script.
> > 	The warning above is from that run.
> > - Added this suggested patch from Zi
> > 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > 	index 1400e674ab86..77a4aea31a7f 100644
> > 	--- a/mm/page_alloc.c
> > 	+++ b/mm/page_alloc.c
> > 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >  		end = pageblock_end_pfn(pfn) - 1;
> >
> >  		/* Do not cross zone boundaries */
> > 	+#if 0
> >  		if (!zone_spans_pfn(zone, start))
> > 			start = zone->zone_start_pfn;
> > 	+#else
> > 	+	if (!zone_spans_pfn(zone, start))
> > 	+		start = pfn;
> > 	+#endif
> > 	 	if (!zone_spans_pfn(zone, end))
> > 	 		return false;
> > 	I can still trigger warnings.
> 
> OK. One thing to note is that the page type in the warning changed from
> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> 

Just to be really clear,
- the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
- the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
  path WITHOUT your change.

I am guessing the difference here has more to do with the allocation path?

I went back and reran focusing on the specific migrate type.
Without your patch, and coming from the alloc_contig_range call path,
I got two warnings of 'page type is 0, passed migratetype is 1' as above.
With your patch I got one 'page type is 0, passed migratetype is 1'
warning and one 'page type is 1, passed migratetype is 0' warning.

I could be wrong, but I do not think your patch changes things.

> >
> > One idea about recreating the issue is that it may have to do with size
> > of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> > to really stress the allocations by increasing the number of hugetlb
> > pages requested and that did not help.  I also noticed that I only seem
> > to get two warnings and then they stop, even if I continue to run the
> > script.
> >
> > Zi asked about my config, so it is attached.
> 
> With your config, I still have no luck reproducing the issue. I will keep
> trying. Thanks.
> 

Perhaps try running both scripts in parallel?
Adjust the number of hugetlb pages allocated to equal 25% of memory?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20  0:32                   ` Mike Kravetz
@ 2023-09-20  1:38                     ` Zi Yan
  2023-09-20  6:07                       ` Vlastimil Babka
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-20  1:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Johannes Weiner, Vlastimil Babka, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 8917 bytes --]

On 19 Sep 2023, at 20:32, Mike Kravetz wrote:

> On 09/19/23 16:57, Zi Yan wrote:
>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>
>>> On 09/19/23 02:49, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>
>>> Sorry for causing the confusion!
>>>
>>> When I originally saw the warnings pop up, I was running the above script
>>> as well as another that only allocated order 9 hugetlb pages:
>>>
>>> while true; do
>>> 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>> The warnings were actually triggered by allocations in this second script.
>>>
>>> However, when reporting the warnings I wanted to include the simplest
>>> way to recreate.  And, I noticed that that second script running in
>>> parallel was not required.  Again, sorry for the confusion!  Here is a
>>> warning triggered via the alloc_contig_range path only running the one
>>> script.
>>>
>>> [  107.275821] ------------[ cut here ]------------
>>> [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
>>> [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
>>> [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
>>> [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
>>> [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
>>> [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
>>> [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
>>> [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
>>> [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
>>> [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
>>> [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
>>> [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
>>> [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
>>> [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
>>> [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
>>> [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  107.321575] Call Trace:
>>> [  107.322314]  <TASK>
>>> [  107.323002]  ? del_page_from_free_list+0x137/0x170
>>> [  107.324380]  ? __warn+0x7d/0x130
>>> [  107.325341]  ? del_page_from_free_list+0x137/0x170
>>> [  107.326627]  ? report_bug+0x18d/0x1c0
>>> [  107.327632]  ? prb_read_valid+0x17/0x20
>>> [  107.328711]  ? handle_bug+0x41/0x70
>>> [  107.329685]  ? exc_invalid_op+0x13/0x60
>>> [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
>>> [  107.331937]  ? del_page_from_free_list+0x137/0x170
>>> [  107.333189]  __free_one_page+0x2ab/0x6f0
>>> [  107.334375]  free_pcppages_bulk+0x169/0x210
>>> [  107.335575]  drain_pages_zone+0x3f/0x50
>>> [  107.336691]  __drain_all_pages+0xe2/0x1e0
>>> [  107.337843]  alloc_contig_range+0x143/0x280
>>> [  107.339026]  alloc_contig_pages+0x210/0x270
>>> [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
>>> [  107.341529]  alloc_pool_huge_page+0x7d/0x100
>>> [  107.342745]  set_max_huge_pages+0x162/0x340
>>> [  107.345059]  nr_hugepages_store_common+0x91/0xf0
>>> [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
>>> [  107.347547]  vfs_write+0x207/0x400
>>> [  107.348543]  ksys_write+0x63/0xe0
>>> [  107.349511]  do_syscall_64+0x37/0x90
>>> [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>> [  107.351940] RIP: 0033:0x7fabb8daee87
>>> [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
>>> [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
>>> [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
>>> [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
>>> [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
>>> [  107.365968]  </TASK>
>>> [  107.366534] ---[ end trace 0000000000000000 ]---
>>> [  121.542474] ------------[ cut here ]------------
>>>
>>> Perhaps that is another piece of information in that the warning can be
>>> triggered via both allocation paths.
>>>
>>> To be perfectly clear, here is what I did today:
>>> - built next-20230919.  It does not contain your series
>>>   	I could not recreate the issue.
>>> - Added your series and the patch to remove
>>>   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
>>> 	I could recreate the issue while running only the one script.
>>> 	The warning above is from that run.
>>> - Added this suggested patch from Zi
>>> 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> 	index 1400e674ab86..77a4aea31a7f 100644
>>> 	--- a/mm/page_alloc.c
>>> 	+++ b/mm/page_alloc.c
>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>
>>>  		/* Do not cross zone boundaries */
>>> 	+#if 0
>>>  		if (!zone_spans_pfn(zone, start))
>>> 			start = zone->zone_start_pfn;
>>> 	+#else
>>> 	+	if (!zone_spans_pfn(zone, start))
>>> 	+		start = pfn;
>>> 	+#endif
>>> 	 	if (!zone_spans_pfn(zone, end))
>>> 	 		return false;
>>> 	I can still trigger warnings.
>>
>> OK. One thing to note is that the page type in the warning changed from
>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>
>
> Just to be really clear,
> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>   path WITHOUT your change.
>
> I am guessing the difference here has more to do with the allocation path?
>
> I went back and reran focusing on the specific migrate type.
> Without your patch, and coming from the alloc_contig_range call path,
> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> With your patch I got one 'page type is 0, passed migratetype is 1'
> warning and one 'page type is 1, passed migratetype is 0' warning.
>
> I could be wrong, but I do not think your patch changes things.

Got it. Thanks for the clarification.
>
>>>
>>> One idea about recreating the issue is that it may have to do with size
>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>> to really stress the allocations by increasing the number of hugetlb
>>> pages requested and that did not help.  I also noticed that I only seem
>>> to get two warnings and then they stop, even if I continue to run the
>>> script.
>>>
>>> Zi asked about my config, so it is attached.
>>
>> With your config, I still have no luck reproducing the issue. I will keep
>> trying. Thanks.
>>
>
> Perhaps try running both scripts in parallel?

Yes. It seems to do the trick.

> Adjust the number of hugetlb pages allocated to equal 25% of memory?

I am able to reproduce it with the script below:

while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
 echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
 wait
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
done

I will look into the issue.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20  1:38                     ` Zi Yan
@ 2023-09-20  6:07                       ` Vlastimil Babka
  2023-09-20 13:48                         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Vlastimil Babka @ 2023-09-20  6:07 UTC (permalink / raw)
  To: Zi Yan, Mike Kravetz
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

On 9/20/23 03:38, Zi Yan wrote:
> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> 
>> On 09/19/23 16:57, Zi Yan wrote:
>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>
>>>> 	--- a/mm/page_alloc.c
>>>> 	+++ b/mm/page_alloc.c
>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>
>>>>  		/* Do not cross zone boundaries */
>>>> 	+#if 0
>>>>  		if (!zone_spans_pfn(zone, start))
>>>> 			start = zone->zone_start_pfn;
>>>> 	+#else
>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>> 	+		start = pfn;
>>>> 	+#endif
>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>> 	 		return false;
>>>> 	I can still trigger warnings.
>>>
>>> OK. One thing to note is that the page type in the warning changed from
>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>
>>
>> Just to be really clear,
>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>   path WITHOUT your change.
>>
>> I am guessing the difference here has more to do with the allocation path?
>>
>> I went back and reran focusing on the specific migrate type.
>> Without your patch, and coming from the alloc_contig_range call path,
>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>> With your patch I got one 'page type is 0, passed migratetype is 1'
>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>
>> I could be wrong, but I do not think your patch changes things.
> 
> Got it. Thanks for the clarification.
>>
>>>>
>>>> One idea about recreating the issue is that it may have to do with size
>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>> to really stress the allocations by increasing the number of hugetlb
>>>> pages requested and that did not help.  I also noticed that I only seem
>>>> to get two warnings and then they stop, even if I continue to run the
>>>> script.
>>>>
>>>> Zi asked about my config, so it is attached.
>>>
>>> With your config, I still have no luck reproducing the issue. I will keep
>>> trying. Thanks.
>>>
>>
>> Perhaps try running both scripts in parallel?
> 
> Yes. It seems to do the trick.
> 
>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> 
> I am able to reproduce it with the script below:
> 
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>  wait
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> done
> 
> I will look into the issue.

With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
deterministic, more racy) it's possible we now see what I suspected can
happen here:
https://lore.kernel.org/all/37dbd4d0-c125-6694-dec4-6322ae5b6dee@suse.cz/
In that there are places reading the migratetype outside of zone lock.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20  6:07                       ` Vlastimil Babka
@ 2023-09-20 13:48                         ` Johannes Weiner
  2023-09-20 16:04                           ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-20 13:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Zi Yan, Mike Kravetz, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> On 9/20/23 03:38, Zi Yan wrote:
> > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > 
> >> On 09/19/23 16:57, Zi Yan wrote:
> >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> >>>
> >>>> 	--- a/mm/page_alloc.c
> >>>> 	+++ b/mm/page_alloc.c
> >>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >>>>  		end = pageblock_end_pfn(pfn) - 1;
> >>>>
> >>>>  		/* Do not cross zone boundaries */
> >>>> 	+#if 0
> >>>>  		if (!zone_spans_pfn(zone, start))
> >>>> 			start = zone->zone_start_pfn;
> >>>> 	+#else
> >>>> 	+	if (!zone_spans_pfn(zone, start))
> >>>> 	+		start = pfn;
> >>>> 	+#endif
> >>>> 	 	if (!zone_spans_pfn(zone, end))
> >>>> 	 		return false;
> >>>> 	I can still trigger warnings.
> >>>
> >>> OK. One thing to note is that the page type in the warning changed from
> >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> >>>
> >>
> >> Just to be really clear,
> >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> >>   path WITHOUT your change.
> >>
> >> I am guessing the difference here has more to do with the allocation path?
> >>
> >> I went back and reran focusing on the specific migrate type.
> >> Without your patch, and coming from the alloc_contig_range call path,
> >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> >> With your patch I got one 'page type is 0, passed migratetype is 1'
> >> warning and one 'page type is 1, passed migratetype is 0' warning.
> >>
> >> I could be wrong, but I do not think your patch changes things.
> > 
> > Got it. Thanks for the clarification.
> >>
> >>>>
> >>>> One idea about recreating the issue is that it may have to do with size
> >>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> >>>> to really stress the allocations by increasing the number of hugetlb
> >>>> pages requested and that did not help.  I also noticed that I only seem
> >>>> to get two warnings and then they stop, even if I continue to run the
> >>>> script.
> >>>>
> >>>> Zi asked about my config, so it is attached.
> >>>
> >>> With your config, I still have no luck reproducing the issue. I will keep
> >>> trying. Thanks.
> >>>
> >>
> >> Perhaps try running both scripts in parallel?
> > 
> > Yes. It seems to do the trick.
> > 
> >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > 
> > I am able to reproduce it with the script below:
> > 
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> >  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> >  wait
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > done
> > 
> > I will look into the issue.

Nice!

I managed to reproduce it ONCE, triggering it not even a second after
starting the script. But I can't seem to do it twice, even after
several reboots and letting it run for minutes.

> With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
> deterministic, more racy) it's possible we now see what I suspected can
> happen here:
> https://lore.kernel.org/all/37dbd4d0-c125-6694-dec4-6322ae5b6dee@suse.cz/
> In that there are places reading the migratetype outside of zone lock.

Good point!

I had already written up a fix for this issue. Still trying to get the
reproducer to work, but attaching the fix below in case somebody with
a working environment beats me to it.

---

From 94f67bfa29a602a66014d079431b224cacbf79e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 16:23:38 -0400
Subject: [PATCH] mm: page_alloc: close migratetype race between freeing and
 stealing

There are three freeing paths that read the page's migratetype
optimistically before grabbing the zone lock. When this races with
block stealing, those pages go on the wrong freelist.

The paths in question are:
- when freeing >costly orders that aren't THP
- when freeing pages to the buddy upon pcp lock contention
- when freeing pages that are isolated
- when freeing pages initially during boot
- when freeing the remainder in alloc_pages_exact()
- when "accepting" unaccepted VM host memory before first use
- when freeing pages during unpoisoning

None of these are so hot that they would need this optimization at the
cost of hampering defrag efforts. Especially when contrasted with the
fact that the most common buddy freeing path - free_pcppages_bulk - is
checking the migratetype under the zone->lock just fine.

In addition, isolated pages need to look up the migratetype under the
lock anyway, which adds branches to the locked section, and results in
a double lookup when the pages are in fact isolated.

Move the lookups into the lock.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 47 +++++++++++++++++------------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ca999d24a00..d902a8aaa3fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1222,18 +1222,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
-static void free_one_page(struct zone *zone,
-				struct page *page, unsigned long pfn,
-				unsigned int order,
-				int migratetype, fpi_t fpi_flags)
+static void free_one_page(struct zone *zone, struct page *page,
+			  unsigned long pfn, unsigned int order,
+			  fpi_t fpi_flags)
 {
 	unsigned long flags;
+	int migratetype;
 
 	spin_lock_irqsave(&zone->lock, flags);
-	if (unlikely(has_isolate_pageblock(zone) ||
-		is_migrate_isolate(migratetype))) {
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -1249,18 +1246,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	if (!free_pages_prepare(page, order, fpi_flags))
 		return;
 
-	/*
-	 * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here
-	 * is used to avoid calling get_pfnblock_migratetype() under the lock.
-	 * This will reduce the lock holding time.
-	 */
-	migratetype = get_pfnblock_migratetype(page, pfn);
-
 	spin_lock_irqsave(&zone->lock, flags);
-	if (unlikely(has_isolate_pageblock(zone) ||
-		is_migrate_isolate(migratetype))) {
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2404,7 +2391,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
 	unsigned long pfn = page_to_pfn(page);
-	int migratetype, pcpmigratetype;
+	int migratetype;
 
 	if (!free_pages_prepare(page, order, FPI_NONE))
 		return;
@@ -2416,23 +2403,23 @@ void free_unref_page(struct page *page, unsigned int order)
 	 * get those areas back if necessary. Otherwise, we may have to free
 	 * excessively into the page allocator
 	 */
-	migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
 			return;
 		}
-		pcpmigratetype = MIGRATE_MOVABLE;
+		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
+		free_one_page(zone, page, pfn, order, FPI_NONE);
 	}
 	pcp_trylock_finish(UP_flags);
 }
@@ -2465,7 +2452,7 @@ void free_unref_page_list(struct list_head *list)
 		migratetype = get_pfnblock_migratetype(page, pfn);
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
 			continue;
 		}
 	}
@@ -2498,8 +2485,7 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, pfn,
-					      0, migratetype, FPI_NONE);
+				free_one_page(zone, page, pfn, 0, FPI_NONE);
 				locked_zone = NULL;
 				continue;
 			}
@@ -6537,13 +6523,14 @@ bool take_page_off_buddy(struct page *page)
 bool put_page_back_buddy(struct page *page)
 {
 	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int migratetype = get_pfnblock_migratetype(page, pfn);
 	bool ret = false;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	if (put_page_testzero(page)) {
+		unsigned long pfn = page_to_pfn(page);
+		int migratetype = get_pfnblock_migratetype(page, pfn);
+
 		ClearPageHWPoisonTakenOff(page);
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
 		if (TestClearPageHWPoison(page)) {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20 13:48                         ` Johannes Weiner
@ 2023-09-20 16:04                           ` Johannes Weiner
  2023-09-20 17:23                             ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-20 16:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Zi Yan, Mike Kravetz, Andrew Morton, Mel Gorman, Miaohe Lin,
	Kefeng Wang, linux-mm, linux-kernel

On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> > On 9/20/23 03:38, Zi Yan wrote:
> > > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > > 
> > >> On 09/19/23 16:57, Zi Yan wrote:
> > >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> > >>>
> > >>>> 	--- a/mm/page_alloc.c
> > >>>> 	+++ b/mm/page_alloc.c
> > >>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> > >>>>  		end = pageblock_end_pfn(pfn) - 1;
> > >>>>
> > >>>>  		/* Do not cross zone boundaries */
> > >>>> 	+#if 0
> > >>>>  		if (!zone_spans_pfn(zone, start))
> > >>>> 			start = zone->zone_start_pfn;
> > >>>> 	+#else
> > >>>> 	+	if (!zone_spans_pfn(zone, start))
> > >>>> 	+		start = pfn;
> > >>>> 	+#endif
> > >>>> 	 	if (!zone_spans_pfn(zone, end))
> > >>>> 	 		return false;
> > >>>> 	I can still trigger warnings.
> > >>>
> > >>> OK. One thing to note is that the page type in the warning changed from
> > >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> > >>>
> > >>
> > >> Just to be really clear,
> > >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> > >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> > >>   path WITHOUT your change.
> > >>
> > >> I am guessing the difference here has more to do with the allocation path?
> > >>
> > >> I went back and reran focusing on the specific migrate type.
> > >> Without your patch, and coming from the alloc_contig_range call path,
> > >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> > >> With your patch I got one 'page type is 0, passed migratetype is 1'
> > >> warning and one 'page type is 1, passed migratetype is 0' warning.
> > >>
> > >> I could be wrong, but I do not think your patch changes things.
> > > 
> > > Got it. Thanks for the clarification.
> > >>
> > >>>>
> > >>>> One idea about recreating the issue is that it may have to do with size
> > >>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> > >>>> to really stress the allocations by increasing the number of hugetlb
> > >>>> pages requested and that did not help.  I also noticed that I only seem
> > >>>> to get two warnings and then they stop, even if I continue to run the
> > >>>> script.
> > >>>>
> > >>>> Zi asked about my config, so it is attached.
> > >>>
> > >>> With your config, I still have no luck reproducing the issue. I will keep
> > >>> trying. Thanks.
> > >>>
> > >>
> > >> Perhaps try running both scripts in parallel?
> > > 
> > > Yes. It seems to do the trick.
> > > 
> > >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > > 
> > > I am able to reproduce it with the script below:
> > > 
> > > while true; do
> > >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> > >  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> > >  wait
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > > done
> > > 
> > > I will look into the issue.
> 
> Nice!
> 
> I managed to reproduce it ONCE, triggering it not even a second after
> starting the script. But I can't seem to do it twice, even after
> several reboots and letting it run for minutes.

I managed to reproduce it reliably by cutting the nr_hugepages
parameters respectively in half.

The one that triggers for me is always MIGRATE_ISOLATE. With some
printk-tracing, the scenario seems to be this:

#0                                                   #1
start_isolate_page_range()
  isolate_single_pageblock()
    set_migratetype_isolate(tail)
      lock zone->lock
      move_freepages_block(tail) // nop
      set_pageblock_migratetype(tail)
      unlock zone->lock
                                                     del_page_from_freelist(head)
                                                     expand(head, head_mt)
                                                       WARN(head_mt != tail_mt)
    start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
    for (pfn = start_pfn, pfn < end_pfn)
      if (PageBuddy())
        split_free_page(head)

IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
lock. The move_freepages_block() does nothing because the PageBuddy()
is set on the pageblock to the left. Once we drop the lock, the buddy
gets allocated and the expand() puts things on the wrong list. The
splitting code that handles MAX_ORDER blocks runs *after* the tail
type is set and the lock has been dropped, so it's too late.

I think this would work fine if we always set MIGRATE_ISOLATE in a
linear fashion, with start and end aligned to MAX_ORDER. Then we also
wouldn't have to split things.

There are two reasons this doesn't happen today:

1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
   this test case they always seem aligned, but it's not
   guaranteed. However,

2. start_isolate_page_range() explicitly breaks ordering by doing the
   last block in the range before the center. It's that last block
   that triggers the race with __rmqueue_smallest -> expand() for me.

With the below patch I can no longer reproduce the issue:

---

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..b7c8730bf0e2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long pfn;
 	struct page *page;
 	/* isolation is done at page block granularity */
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 	int ret;
 	bool skip_isolation = false;
 
@@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	if (ret)
 		return ret;
 
-	if (isolate_start == isolate_end - pageblock_nr_pages)
-		skip_isolation = true;
-
-	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
-			skip_isolation, migratetype);
-	if (ret) {
-		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
-		return ret;
-	}
-
 	/* skip isolated pageblocks at the beginning and end */
 	for (pfn = isolate_start + pageblock_nr_pages;
 	     pfn < isolate_end - pageblock_nr_pages;
@@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(isolate_start, pfn, migratetype);
-			unset_migratetype_isolate(
-				pfn_to_page(isolate_end - pageblock_nr_pages),
-				migratetype);
 			return -EBUSY;
 		}
 	}
+
+	if (isolate_start == isolate_end - pageblock_nr_pages)
+		skip_isolation = true;
+
+	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+			skip_isolation, migratetype);
+	if (ret) {
+		undo_isolate_page_range(isolate_start, pfn, migratetype);
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 
 	for (pfn = isolate_start;
 	     pfn < isolate_end;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20 16:04                           ` Johannes Weiner
@ 2023-09-20 17:23                             ` Zi Yan
  2023-09-21  2:31                               ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-20 17:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Mike Kravetz, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel,
	David Hildenbrand

[-- Attachment #1: Type: text/plain, Size: 8823 bytes --]

On 20 Sep 2023, at 12:04, Johannes Weiner wrote:

> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>> On 9/20/23 03:38, Zi Yan wrote:
>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>
>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>
>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>
>>>>>>>  		/* Do not cross zone boundaries */
>>>>>>> 	+#if 0
>>>>>>>  		if (!zone_spans_pfn(zone, start))
>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>> 	+#else
>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>> 	+		start = pfn;
>>>>>>> 	+#endif
>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>> 	 		return false;
>>>>>>> 	I can still trigger warnings.
>>>>>>
>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>
>>>>>
>>>>> Just to be really clear,
>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>   path WITHOUT your change.
>>>>>
>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>
>>>>> I went back and reran focusing on the specific migrate type.
>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>
>>>>> I could be wrong, but I do not think your patch changes things.
>>>>
>>>> Got it. Thanks for the clarification.
>>>>>
>>>>>>>
>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>> script.
>>>>>>>
>>>>>>> Zi asked about my config, so it is attached.
>>>>>>
>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>> trying. Thanks.
>>>>>>
>>>>>
>>>>> Perhaps try running both scripts in parallel?
>>>>
>>>> Yes. It seems to do the trick.
>>>>
>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>
>>>> I am able to reproduce it with the script below:
>>>>
>>>> while true; do
>>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>  wait
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>> done
>>>>
>>>> I will look into the issue.
>>
>> Nice!
>>
>> I managed to reproduce it ONCE, triggering it not even a second after
>> starting the script. But I can't seem to do it twice, even after
>> several reboots and letting it run for minutes.
>
> I managed to reproduce it reliably by cutting the nr_hugepages
> parameters respectively in half.
>
> The one that triggers for me is always MIGRATE_ISOLATE. With some
> printk-tracing, the scenario seems to be this:
>
> #0                                                   #1
> start_isolate_page_range()
>   isolate_single_pageblock()
>     set_migratetype_isolate(tail)
>       lock zone->lock
>       move_freepages_block(tail) // nop
>       set_pageblock_migratetype(tail)
>       unlock zone->lock
>                                                      del_page_from_freelist(head)
>                                                      expand(head, head_mt)
>                                                        WARN(head_mt != tail_mt)
>     start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>     for (pfn = start_pfn, pfn < end_pfn)
>       if (PageBuddy())
>         split_free_page(head)
>
> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
> lock. The move_freepages_block() does nothing because the PageBuddy()
> is set on the pageblock to the left. Once we drop the lock, the buddy
> gets allocated and the expand() puts things on the wrong list. The
> splitting code that handles MAX_ORDER blocks runs *after* the tail
> type is set and the lock has been dropped, so it's too late.

Yes, this is the issue I can confirm as well. But it is intentional to enable
allocating a contiguous range at pageblock granularity instead of MAX_ORDER
granularity. With your changes below, it no longer works, because if there
is an unmovable page in
[ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
the allocation fails but it would succeed in current implementation.

I think a proper fix would be to make move_freepages_block() split the
MAX_ORDER page and put the split pages in the right migratetype free lists.

I am working on that.

>
> I think this would work fine if we always set MIGRATE_ISOLATE in a
> linear fashion, with start and end aligned to MAX_ORDER. Then we also
> wouldn't have to split things.
>
> There are two reasons this doesn't happen today:
>
> 1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
>    this test case they always seem aligned, but it's not
>    guaranteed. However,
>
> 2. start_isolate_page_range() explicitly breaks ordering by doing the
>    last block in the range before the center. It's that last block
>    that triggers the race with __rmqueue_smallest -> expand() for me.
>
> With the below patch I can no longer reproduce the issue:
>
> ---
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b5c7a9d21257..b7c8730bf0e2 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	unsigned long pfn;
>  	struct page *page;
>  	/* isolation is done at page block granularity */
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>  	int ret;
>  	bool skip_isolation = false;
>
> @@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	if (ret)
>  		return ret;
>
> -	if (isolate_start == isolate_end - pageblock_nr_pages)
> -		skip_isolation = true;
> -
> -	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> -	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> -			skip_isolation, migratetype);
> -	if (ret) {
> -		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> -		return ret;
> -	}
> -
>  	/* skip isolated pageblocks at the beginning and end */
>  	for (pfn = isolate_start + pageblock_nr_pages;
>  	     pfn < isolate_end - pageblock_nr_pages;
> @@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
>  			undo_isolate_page_range(isolate_start, pfn, migratetype);
> -			unset_migratetype_isolate(
> -				pfn_to_page(isolate_end - pageblock_nr_pages),
> -				migratetype);
>  			return -EBUSY;
>  		}
>  	}
> +
> +	if (isolate_start == isolate_end - pageblock_nr_pages)
> +		skip_isolation = true;
> +
> +	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> +	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> +			skip_isolation, migratetype);
> +	if (ret) {
> +		undo_isolate_page_range(isolate_start, pfn, migratetype);
> +		return ret;
> +	}
> +
>  	return 0;
>  }
>
> @@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	unsigned long pfn;
>  	struct page *page;
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>
>  	for (pfn = isolate_start;
>  	     pfn < isolate_end;


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-20 17:23                             ` Zi Yan
@ 2023-09-21  2:31                               ` Zi Yan
  2023-09-21 10:19                                 ` David Hildenbrand
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-21  2:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Mike Kravetz, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel,
	David Hildenbrand

[-- Attachment #1: Type: text/plain, Size: 7159 bytes --]

On 20 Sep 2023, at 13:23, Zi Yan wrote:

> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>
>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>
>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>
>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>
>>>>>>>>  		/* Do not cross zone boundaries */
>>>>>>>> 	+#if 0
>>>>>>>>  		if (!zone_spans_pfn(zone, start))
>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>> 	+#else
>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>> 	+		start = pfn;
>>>>>>>> 	+#endif
>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>> 	 		return false;
>>>>>>>> 	I can still trigger warnings.
>>>>>>>
>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>
>>>>>>
>>>>>> Just to be really clear,
>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>   path WITHOUT your change.
>>>>>>
>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>
>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>
>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>
>>>>> Got it. Thanks for the clarification.
>>>>>>
>>>>>>>>
>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>> script.
>>>>>>>>
>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>
>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>> trying. Thanks.
>>>>>>>
>>>>>>
>>>>>> Perhaps try running both scripts in parallel?
>>>>>
>>>>> Yes. It seems to do the trick.
>>>>>
>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>
>>>>> I am able to reproduce it with the script below:
>>>>>
>>>>> while true; do
>>>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>  wait
>>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>> done
>>>>>
>>>>> I will look into the issue.
>>>
>>> Nice!
>>>
>>> I managed to reproduce it ONCE, triggering it not even a second after
>>> starting the script. But I can't seem to do it twice, even after
>>> several reboots and letting it run for minutes.
>>
>> I managed to reproduce it reliably by cutting the nr_hugepages
>> parameters respectively in half.
>>
>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>> printk-tracing, the scenario seems to be this:
>>
>> #0                                                   #1
>> start_isolate_page_range()
>>   isolate_single_pageblock()
>>     set_migratetype_isolate(tail)
>>       lock zone->lock
>>       move_freepages_block(tail) // nop
>>       set_pageblock_migratetype(tail)
>>       unlock zone->lock
>>                                                      del_page_from_freelist(head)
>>                                                      expand(head, head_mt)
>>                                                        WARN(head_mt != tail_mt)
>>     start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>     for (pfn = start_pfn, pfn < end_pfn)
>>       if (PageBuddy())
>>         split_free_page(head)
>>
>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>> lock. The move_freepages_block() does nothing because the PageBuddy()
>> is set on the pageblock to the left. Once we drop the lock, the buddy
>> gets allocated and the expand() puts things on the wrong list. The
>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>> type is set and the lock has been dropped, so it's too late.
>
> Yes, this is the issue I can confirm as well. But it is intentional to enable
> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
> granularity. With your changes below, it no longer works, because if there
> is an unmovable page in
> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
> the allocation fails but it would succeed in current implementation.
>
> I think a proper fix would be to make move_freepages_block() split the
> MAX_ORDER page and put the split pages in the right migratetype free lists.
>
> I am working on that.

After spending half a day on this, I think it is much harder than I thought
to get alloc_contig_range() working with the freelist migratetype hygiene
patchset. Because alloc_contig_range() relies on racy migratetype changes:

1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
free list yet.

2. later in the process, isolate_freepages_range() is used to actually grab
the free pages.

3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
in-use pages. But it is not the case when alloc_contig_range() work on
pageblock aligned ranges. Now during isolation phase, free or in-use pages
will need to be split to get their subpages into the right free lists.

4. the hardest case is when a in-use page sits across two pageblocks, currently,
the code just isolate one pageblock, migrate the page, and let split_free_page()
to correct the free list later. But to strictly enforce freelist migratetype
hygiene, extra work is needed at free page path to split the free page into
the right freelists.

I need more time to think about how to get alloc_contig_range() properly.
Help is needed for the bullet point 4.

Thanks.

PS: One observation is that after move_to_free_list(), a page's migratetype
does not match the migratetype of its free list. I might need to make
changes on top of your patchset to get alloc_contig_range() working.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-21  2:31                               ` Zi Yan
@ 2023-09-21 10:19                                 ` David Hildenbrand
  2023-09-21 14:47                                   ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: David Hildenbrand @ 2023-09-21 10:19 UTC (permalink / raw)
  To: Zi Yan, Johannes Weiner
  Cc: Vlastimil Babka, Mike Kravetz, Andrew Morton, Mel Gorman,
	Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On 21.09.23 04:31, Zi Yan wrote:
> On 20 Sep 2023, at 13:23, Zi Yan wrote:
> 
>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>
>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>
>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>
>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>> 	+#if 0
>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>> 	+#else
>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>> 	+		start = pfn;
>>>>>>>>> 	+#endif
>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>> 	 		return false;
>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>
>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>
>>>>>>>
>>>>>>> Just to be really clear,
>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>    path WITHOUT your change.
>>>>>>>
>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>
>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>
>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>
>>>>>> Got it. Thanks for the clarification.
>>>>>>>
>>>>>>>>>
>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>> script.
>>>>>>>>>
>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>
>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>> trying. Thanks.
>>>>>>>>
>>>>>>>
>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>
>>>>>> Yes. It seems to do the trick.
>>>>>>
>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>
>>>>>> I am able to reproduce it with the script below:
>>>>>>
>>>>>> while true; do
>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>   wait
>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>> done
>>>>>>
>>>>>> I will look into the issue.
>>>>
>>>> Nice!
>>>>
>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>> starting the script. But I can't seem to do it twice, even after
>>>> several reboots and letting it run for minutes.
>>>
>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>> parameters respectively in half.
>>>
>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>> printk-tracing, the scenario seems to be this:
>>>
>>> #0                                                   #1
>>> start_isolate_page_range()
>>>    isolate_single_pageblock()
>>>      set_migratetype_isolate(tail)
>>>        lock zone->lock
>>>        move_freepages_block(tail) // nop
>>>        set_pageblock_migratetype(tail)
>>>        unlock zone->lock
>>>                                                       del_page_from_freelist(head)
>>>                                                       expand(head, head_mt)
>>>                                                         WARN(head_mt != tail_mt)
>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>        if (PageBuddy())
>>>          split_free_page(head)
>>>
>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>> gets allocated and the expand() puts things on the wrong list. The
>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>> type is set and the lock has been dropped, so it's too late.
>>
>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>> granularity. With your changes below, it no longer works, because if there
>> is an unmovable page in
>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>> the allocation fails but it would succeed in current implementation.
>>
>> I think a proper fix would be to make move_freepages_block() split the
>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>
>> I am working on that.
> 
> After spending half a day on this, I think it is much harder than I thought
> to get alloc_contig_range() working with the freelist migratetype hygiene
> patchset. Because alloc_contig_range() relies on racy migratetype changes:
> 
> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
> free list yet.
> 
> 2. later in the process, isolate_freepages_range() is used to actually grab
> the free pages.
> 
> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
> in-use pages. But it is not the case when alloc_contig_range() work on
> pageblock aligned ranges. Now during isolation phase, free or in-use pages
> will need to be split to get their subpages into the right free lists.
> 
> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
> the code just isolate one pageblock, migrate the page, and let split_free_page()
> to correct the free list later. But to strictly enforce freelist migratetype
> hygiene, extra work is needed at free page path to split the free page into
> the right freelists.
> 
> I need more time to think about how to get alloc_contig_range() properly.
> Help is needed for the bullet point 4.


I once raised that we should maybe try making MIGRATE_ISOLATE a flag 
that preserves the original migratetype. Not sure if that would help 
here in any way.

The whole alloc_contig_range() implementation is quite complicated and 
hard to grasp. If we could find ways to clean all that up and make it 
easier to understand and play along, that would be nice.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-21 10:19                                 ` David Hildenbrand
@ 2023-09-21 14:47                                   ` Zi Yan
  2023-09-25 21:12                                     ` Zi Yan
  2023-09-26 18:19                                     ` David Hildenbrand
  0 siblings, 2 replies; 83+ messages in thread
From: Zi Yan @ 2023-09-21 14:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9006 bytes --]

On 21 Sep 2023, at 6:19, David Hildenbrand wrote:

> On 21.09.23 04:31, Zi Yan wrote:
>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>
>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>
>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>
>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>
>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>> 	+#if 0
>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>> 	+#else
>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>> 	+#endif
>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>> 	 		return false;
>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>
>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Just to be really clear,
>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>    path WITHOUT your change.
>>>>>>>>
>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>
>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>
>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>
>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>> script.
>>>>>>>>>>
>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>
>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>> trying. Thanks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>
>>>>>>> Yes. It seems to do the trick.
>>>>>>>
>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>
>>>>>>> I am able to reproduce it with the script below:
>>>>>>>
>>>>>>> while true; do
>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>   wait
>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>> done
>>>>>>>
>>>>>>> I will look into the issue.
>>>>>
>>>>> Nice!
>>>>>
>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>> starting the script. But I can't seem to do it twice, even after
>>>>> several reboots and letting it run for minutes.
>>>>
>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>> parameters respectively in half.
>>>>
>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>> printk-tracing, the scenario seems to be this:
>>>>
>>>> #0                                                   #1
>>>> start_isolate_page_range()
>>>>    isolate_single_pageblock()
>>>>      set_migratetype_isolate(tail)
>>>>        lock zone->lock
>>>>        move_freepages_block(tail) // nop
>>>>        set_pageblock_migratetype(tail)
>>>>        unlock zone->lock
>>>>                                                       del_page_from_freelist(head)
>>>>                                                       expand(head, head_mt)
>>>>                                                         WARN(head_mt != tail_mt)
>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>        if (PageBuddy())
>>>>          split_free_page(head)
>>>>
>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>> gets allocated and the expand() puts things on the wrong list. The
>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>> type is set and the lock has been dropped, so it's too late.
>>>
>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>> granularity. With your changes below, it no longer works, because if there
>>> is an unmovable page in
>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>> the allocation fails but it would succeed in current implementation.
>>>
>>> I think a proper fix would be to make move_freepages_block() split the
>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>
>>> I am working on that.
>>
>> After spending half a day on this, I think it is much harder than I thought
>> to get alloc_contig_range() working with the freelist migratetype hygiene
>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>
>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>> free list yet.
>>
>> 2. later in the process, isolate_freepages_range() is used to actually grab
>> the free pages.
>>
>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>> in-use pages. But it is not the case when alloc_contig_range() work on
>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>> will need to be split to get their subpages into the right free lists.
>>
>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>> to correct the free list later. But to strictly enforce freelist migratetype
>> hygiene, extra work is needed at free page path to split the free page into
>> the right freelists.
>>
>> I need more time to think about how to get alloc_contig_range() properly.
>> Help is needed for the bullet point 4.
>
>
> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.

I have that in my backlog since you asked and have been delaying it. ;) Hopefully
I can do it after I fix this. That change might or might not help only if we make
some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
overwrite existing migratetype, the code might not need to split a page and move
it to MIGRATE_ISOLATE freelist?

The fundamental issue in alloc_contig_range() is that to work at
pageblock level, a page (>pageblock_order) can have one part is isolated and
the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
now checks first pageblock migratetype, so such a page needs to be removed
from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
finally put back to multiple free lists. This needs to be done at isolation stage
before free pages are removed from their free lists (the stage after isolation).
If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
in their original migratetype and check migratetype before allocating a page,
that might help. But that might add extra work (e.g., splitting a partially
isolated free page before allocation) in the really hot code path, which is not
desirable.

>
> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.

I will try my best to simplify it.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-21 14:47                                   ` Zi Yan
@ 2023-09-25 21:12                                     ` Zi Yan
  2023-09-26 17:39                                       ` Johannes Weiner
  2023-09-26 18:19                                     ` David Hildenbrand
  1 sibling, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-25 21:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 10136 bytes --]

On 21 Sep 2023, at 10:47, Zi Yan wrote:

> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>
>> On 21.09.23 04:31, Zi Yan wrote:
>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>
>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>
>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>
>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>> 	+#else
>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>> 	+#endif
>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>> 	 		return false;
>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>
>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be really clear,
>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>
>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>
>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>
>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>
>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>
>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>> trying. Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>
>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>
>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>
>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>
>>>>>>>> while true; do
>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>   wait
>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>> done
>>>>>>>>
>>>>>>>> I will look into the issue.
>>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>> several reboots and letting it run for minutes.
>>>>>
>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>> parameters respectively in half.
>>>>>
>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>> printk-tracing, the scenario seems to be this:
>>>>>
>>>>> #0                                                   #1
>>>>> start_isolate_page_range()
>>>>>    isolate_single_pageblock()
>>>>>      set_migratetype_isolate(tail)
>>>>>        lock zone->lock
>>>>>        move_freepages_block(tail) // nop
>>>>>        set_pageblock_migratetype(tail)
>>>>>        unlock zone->lock
>>>>>                                                       del_page_from_freelist(head)
>>>>>                                                       expand(head, head_mt)
>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>        if (PageBuddy())
>>>>>          split_free_page(head)
>>>>>
>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>> type is set and the lock has been dropped, so it's too late.
>>>>
>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>> granularity. With your changes below, it no longer works, because if there
>>>> is an unmovable page in
>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>> the allocation fails but it would succeed in current implementation.
>>>>
>>>> I think a proper fix would be to make move_freepages_block() split the
>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>
>>>> I am working on that.
>>>
>>> After spending half a day on this, I think it is much harder than I thought
>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>
>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>> free list yet.
>>>
>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>> the free pages.
>>>
>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>> will need to be split to get their subpages into the right free lists.
>>>
>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>> to correct the free list later. But to strictly enforce freelist migratetype
>>> hygiene, extra work is needed at free page path to split the free page into
>>> the right freelists.
>>>
>>> I need more time to think about how to get alloc_contig_range() properly.
>>> Help is needed for the bullet point 4.
>>
>>
>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>
> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
> I can do it after I fix this. That change might or might not help only if we make
> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> overwrite existing migratetype, the code might not need to split a page and move
> it to MIGRATE_ISOLATE freelist?
>
> The fundamental issue in alloc_contig_range() is that to work at
> pageblock level, a page (>pageblock_order) can have one part is isolated and
> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> now checks first pageblock migratetype, so such a page needs to be removed
> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> finally put back to multiple free lists. This needs to be done at isolation stage
> before free pages are removed from their free lists (the stage after isolation).
> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> in their original migratetype and check migratetype before allocating a page,
> that might help. But that might add extra work (e.g., splitting a partially
> isolated free page before allocation) in the really hot code path, which is not
> desirable.
>
>>
>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>
> I will try my best to simplify it.

Hi Johannes,

I attached three patches to fix the issue and first two can be folded into
your patchset:

1. __free_one_page() bug you and Vlastimil discussed on the other email.
2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
3. enable move_freepages() to split a free page that is partially covered by
   [start_pfn, end_pfn] in the parameter and set migratetype correctly when
   a >pageblock_order free page is moved. Before when a >pageblock_order
   free page is moved, only first pageblock migratetype is changed. The added
   WARN_ON_ONCE might be triggered by these pages.

I ran Mike's test with transhuge-stress together with my patches on top of your
"close migratetype race" patch for more than an hour without any warning.
It should unblock your patchset. I will keep working on alloc_contig_range()
simplification.


--
Best Regards,
Yan, Zi

[-- Attachment #1.2: 0001-mm-fix-__free_one_page.patch --]
[-- Type: text/plain, Size: 1400 bytes --]

From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Fri, 22 Sep 2023 11:11:32 -0400
Subject: [PATCH 1/3] mm: fix __free_one_page().

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7de022bc4c7d..72f27d14c8e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
 	while (order < MAX_ORDER) {
-		int buddy_mt;
-
 		if (compaction_capture(capc, page, order, migratetype))
 			return;
 
@@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
 		if (!buddy)
 			goto done_merging;
 
-		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
-
 		if (unlikely(order >= pageblock_order)) {
 			/*
 			 * We want to prevent merge between freepages on pageblock
@@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
 		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order);
 		else
-			del_page_from_free_list(buddy, zone, order, buddy_mt);
+			del_page_from_free_list(buddy, zone, order, migratetype);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
-- 
2.40.1


[-- Attachment #1.3: 0002-mm-set-migratetype-after-free-pages-are-moved-betwee.patch --]
[-- Type: text/plain, Size: 3287 bytes --]

From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:27:14 -0400
Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
 free lists.

This avoids changing migratetype after move_freepages() or
move_freepages_block(), which is error prone. It also prepares for upcoming
changes to fix move_freepages() not moving free pages partially in the
range.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 10 +++-------
 mm/page_isolation.c |  2 --
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 72f27d14c8e7..7c41cb5d8a36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1618,6 +1618,7 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
+	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
 
 	return pages_moved;
 }
@@ -1839,7 +1840,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
-		set_pageblock_migratetype(page, start_type);
 		block_type = start_type;
 	}
 
@@ -1911,7 +1911,6 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	if (migratetype_is_mergeable(mt)) {
 		if (move_freepages_block(zone, page,
 					 mt, MIGRATE_HIGHATOMIC) != -1) {
-			set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 			zone->nr_reserved_highatomic += pageblock_nr_pages;
 		}
 	}
@@ -1996,7 +1995,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
-			set_pageblock_migratetype(page, ac->migratetype);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return ret;
@@ -2608,10 +2606,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
-			if (migratetype_is_mergeable(mt) &&
-			    move_freepages_block(zone, page, mt,
-						 MIGRATE_MOVABLE) != -1)
-				set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+			if (migratetype_is_mergeable(mt))
+			    move_freepages_block(zone, page, mt, MIGRATE_MOVABLE);
 		}
 	}
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..ee7818ff4e12 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -187,7 +187,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -262,7 +261,6 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		 */
 		WARN_ON_ONCE(nr_pages == -1);
 	}
-	set_pageblock_migratetype(page, migratetype);
 	if (isolated_page)
 		__putback_isolated_page(page, order, migratetype);
 	zone->nr_isolate_pageblock--;
-- 
2.40.1


[-- Attachment #1.4: 0003-mm-enable-move_freepages-to-properly-move-part-of-fr.patch --]
[-- Type: text/plain, Size: 6916 bytes --]

From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
 pages.

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by teaching move_freepages() to split a free page
when only part of it is going to be moved.

In addition, when a >pageblock_order free page is moved, only its first
pageblock migratetype is changed. It can cause warnings later. Fix it by
set all pageblocks in a free page to the same migratetype after move.

split_free_page() is changed to be used in move_freepages() and
isolate_single_pageblock(). A common code to find the start pfn of a free
page is refactored in get_freepage_start_pfn().

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
 mm/page_isolation.c | 17 +++++++---
 2 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7c41cb5d8a36..3fd5ab40b55c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Get first pfn of the free page, where pfn is in. If this free page does
+ * not exist, return the given pfn.
+ */
+static unsigned long get_freepage_start_pfn(unsigned long pfn)
+{
+	int order = 0;
+	unsigned long start_pfn = pfn;
+
+	while (!PageBuddy(pfn_to_page(start_pfn))) {
+		if (++order > MAX_ORDER) {
+			start_pfn = pfn;
+			break;
+		}
+		start_pfn &= ~0UL << order;
+	}
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
 	struct page *page;
-	unsigned long pfn;
+	unsigned long pfn, pfn2;
 	unsigned int order;
 	int pages_moved = 0;
+	unsigned long mt_change_pfn = start_pfn;
+	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
+
+	/* split at start_pfn if it is in the middle of a free page */
+	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
+		struct page *new_page = pfn_to_page(new_start_pfn);
+		int new_page_order = buddy_order(new_page);
+
+		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
+			/* change migratetype so that split_free_page can work */
+			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+			split_free_page(new_page, buddy_order(new_page),
+					start_pfn - new_start_pfn);
+
+			mt_change_pfn = start_pfn;
+			/* move to next page */
+			start_pfn = new_start_pfn + (1 << new_page_order);
+		}
+	}
+
 
 	for (pfn = start_pfn; pfn <= end_pfn;) {
 		page = pfn_to_page(pfn);
@@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 
 		order = buddy_order(page);
 		move_to_free_list(page, zone, order, old_mt, new_mt);
+		/*
+		 * set page migratetype for all pageblocks within the page and
+		 * only after we move all free pages in one pageblock
+		 */
+		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
+			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
+			     pfn2 += pageblock_nr_pages) {
+				set_pageblock_migratetype(pfn_to_page(pfn2),
+							  new_mt);
+				mt_change_pfn = pfn2;
+			}
+		}
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+	/* set migratetype for the remaining pageblocks */
+	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
+		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
 
 	return pages_moved;
 }
@@ -6214,14 +6266,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
+	outer_start = get_freepage_start_pfn(start);
 
 	if (outer_start != start) {
 		order = buddy_order(pfn_to_page(outer_start));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ee7818ff4e12..b5f90ae03190 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			int order = buddy_order(page);
 
 			if (pfn + (1UL << order) > boundary_pfn) {
+				int res;
+				unsigned long flags;
+
+				spin_lock_irqsave(&zone->lock, flags);
+				res = split_free_page(page, order, boundary_pfn - pfn);
+				spin_unlock_irqrestore(&zone->lock, flags);
+
 				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
+				if (res)
 					continue;
 			}
 
@@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
-- 
2.40.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-25 21:12                                     ` Zi Yan
@ 2023-09-26 17:39                                       ` Johannes Weiner
  2023-09-28  2:51                                         ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-26 17:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
> On 21 Sep 2023, at 10:47, Zi Yan wrote:
> 
> > On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
> >
> >> On 21.09.23 04:31, Zi Yan wrote:
> >>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
> >>>
> >>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
> >>>>
> >>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> >>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> >>>>>>> On 9/20/23 03:38, Zi Yan wrote:
> >>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> >>>>>>>>
> >>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
> >>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> >>>>>>>>>>
> >>>>>>>>>>> 	--- a/mm/page_alloc.c
> >>>>>>>>>>> 	+++ b/mm/page_alloc.c
> >>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
> >>>>>>>>>>>
> >>>>>>>>>>>   		/* Do not cross zone boundaries */
> >>>>>>>>>>> 	+#if 0
> >>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
> >>>>>>>>>>> 			start = zone->zone_start_pfn;
> >>>>>>>>>>> 	+#else
> >>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
> >>>>>>>>>>> 	+		start = pfn;
> >>>>>>>>>>> 	+#endif
> >>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
> >>>>>>>>>>> 	 		return false;
> >>>>>>>>>>> 	I can still trigger warnings.
> >>>>>>>>>>
> >>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
> >>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Just to be really clear,
> >>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> >>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> >>>>>>>>>    path WITHOUT your change.
> >>>>>>>>>
> >>>>>>>>> I am guessing the difference here has more to do with the allocation path?
> >>>>>>>>>
> >>>>>>>>> I went back and reran focusing on the specific migrate type.
> >>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
> >>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> >>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
> >>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
> >>>>>>>>>
> >>>>>>>>> I could be wrong, but I do not think your patch changes things.
> >>>>>>>>
> >>>>>>>> Got it. Thanks for the clarification.
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
> >>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> >>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
> >>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
> >>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
> >>>>>>>>>>> script.
> >>>>>>>>>>>
> >>>>>>>>>>> Zi asked about my config, so it is attached.
> >>>>>>>>>>
> >>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
> >>>>>>>>>> trying. Thanks.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Perhaps try running both scripts in parallel?
> >>>>>>>>
> >>>>>>>> Yes. It seems to do the trick.
> >>>>>>>>
> >>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> >>>>>>>>
> >>>>>>>> I am able to reproduce it with the script below:
> >>>>>>>>
> >>>>>>>> while true; do
> >>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> >>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> >>>>>>>>   wait
> >>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>>>>>>> done
> >>>>>>>>
> >>>>>>>> I will look into the issue.
> >>>>>>
> >>>>>> Nice!
> >>>>>>
> >>>>>> I managed to reproduce it ONCE, triggering it not even a second after
> >>>>>> starting the script. But I can't seem to do it twice, even after
> >>>>>> several reboots and letting it run for minutes.
> >>>>>
> >>>>> I managed to reproduce it reliably by cutting the nr_hugepages
> >>>>> parameters respectively in half.
> >>>>>
> >>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
> >>>>> printk-tracing, the scenario seems to be this:
> >>>>>
> >>>>> #0                                                   #1
> >>>>> start_isolate_page_range()
> >>>>>    isolate_single_pageblock()
> >>>>>      set_migratetype_isolate(tail)
> >>>>>        lock zone->lock
> >>>>>        move_freepages_block(tail) // nop
> >>>>>        set_pageblock_migratetype(tail)
> >>>>>        unlock zone->lock
> >>>>>                                                       del_page_from_freelist(head)
> >>>>>                                                       expand(head, head_mt)
> >>>>>                                                         WARN(head_mt != tail_mt)
> >>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
> >>>>>      for (pfn = start_pfn, pfn < end_pfn)
> >>>>>        if (PageBuddy())
> >>>>>          split_free_page(head)
> >>>>>
> >>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
> >>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
> >>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
> >>>>> gets allocated and the expand() puts things on the wrong list. The
> >>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
> >>>>> type is set and the lock has been dropped, so it's too late.
> >>>>
> >>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
> >>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
> >>>> granularity. With your changes below, it no longer works, because if there
> >>>> is an unmovable page in
> >>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
> >>>> the allocation fails but it would succeed in current implementation.
> >>>>
> >>>> I think a proper fix would be to make move_freepages_block() split the
> >>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
> >>>>
> >>>> I am working on that.
> >>>
> >>> After spending half a day on this, I think it is much harder than I thought
> >>> to get alloc_contig_range() working with the freelist migratetype hygiene
> >>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
> >>>
> >>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
> >>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
> >>> free list yet.
> >>>
> >>> 2. later in the process, isolate_freepages_range() is used to actually grab
> >>> the free pages.
> >>>
> >>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
> >>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
> >>> in-use pages. But it is not the case when alloc_contig_range() work on
> >>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
> >>> will need to be split to get their subpages into the right free lists.
> >>>
> >>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
> >>> the code just isolate one pageblock, migrate the page, and let split_free_page()
> >>> to correct the free list later. But to strictly enforce freelist migratetype
> >>> hygiene, extra work is needed at free page path to split the free page into
> >>> the right freelists.
> >>>
> >>> I need more time to think about how to get alloc_contig_range() properly.
> >>> Help is needed for the bullet point 4.
> >>
> >>
> >> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
> >
> > I have that in my backlog since you asked and have been delaying it. ;) Hopefully
> > I can do it after I fix this. That change might or might not help only if we make
> > some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> > overwrite existing migratetype, the code might not need to split a page and move
> > it to MIGRATE_ISOLATE freelist?
> >
> > The fundamental issue in alloc_contig_range() is that to work at
> > pageblock level, a page (>pageblock_order) can have one part is isolated and
> > the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> > now checks first pageblock migratetype, so such a page needs to be removed
> > from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> > finally put back to multiple free lists. This needs to be done at isolation stage
> > before free pages are removed from their free lists (the stage after isolation).
> > If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> > in their original migratetype and check migratetype before allocating a page,
> > that might help. But that might add extra work (e.g., splitting a partially
> > isolated free page before allocation) in the really hot code path, which is not
> > desirable.
> >
> >>
> >> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
> >
> > I will try my best to simplify it.
> 
> Hi Johannes,
> 
> I attached three patches to fix the issue and first two can be folded into
> your patchset:

Hi Zi, thanks for providing these patches! I'll pick them up into the
series.

> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
> 3. enable move_freepages() to split a free page that is partially covered by
>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>    a >pageblock_order free page is moved. Before when a >pageblock_order
>    free page is moved, only first pageblock migratetype is changed. The added
>    WARN_ON_ONCE might be triggered by these pages.
> 
> I ran Mike's test with transhuge-stress together with my patches on top of your
> "close migratetype race" patch for more than an hour without any warning.
> It should unblock your patchset. I will keep working on alloc_contig_range()
> simplification.
> 
> 
> --
> Best Regards,
> Yan, Zi

> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Fri, 22 Sep 2023 11:11:32 -0400
> Subject: [PATCH 1/3] mm: fix __free_one_page().
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/page_alloc.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7de022bc4c7d..72f27d14c8e7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  	while (order < MAX_ORDER) {
> -		int buddy_mt;
> -
>  		if (compaction_capture(capc, page, order, migratetype))
>  			return;
>  
> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>  		if (!buddy)
>  			goto done_merging;
>  
> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
> -
>  		if (unlikely(order >= pageblock_order)) {
>  			/*
>  			 * We want to prevent merge between freepages on pageblock
> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>  		if (page_is_guard(buddy))
>  			clear_page_guard(zone, buddy, order);
>  		else
> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
> +			del_page_from_free_list(buddy, zone, order, migratetype);
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;

I had a fix for this that's slightly different. The buddy's type can't
be changed while it's still on the freelist, so I moved that
around. The sequence now is:

	int buddy_mt = migratetype;

	if (unlikely(order >= pageblock_order)) {
		/* This is the only case where buddy_mt can differ */
		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
		// compat checks...
	}

	del_page_from_free_list(buddy, buddy_mt);

	if (unlikely(buddy_mt != migratetype))
		set_pageblock_migratetype(buddy, migratetype);


> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:27:14 -0400
> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>  free lists.
> 
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for upcoming
> changes to fix move_freepages() not moving free pages partially in the
> range.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This makes the code much cleaner, thank you!

> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:55:18 -0400
> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>  pages.
> 
> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
> move_freepages(), to isolate free pages. But move_freepages() was not able
> to move free pages partially covered by the specified range, leaving a race
> window open[1]. Fix it by teaching move_freepages() to split a free page
> when only part of it is going to be moved.
> 
> In addition, when a >pageblock_order free page is moved, only its first
> pageblock migratetype is changed. It can cause warnings later. Fix it by
> set all pageblocks in a free page to the same migratetype after move.
> 
> split_free_page() is changed to be used in move_freepages() and
> isolate_single_pageblock(). A common code to find the start pfn of a free
> page is refactored in get_freepage_start_pfn().
> 
> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>  mm/page_isolation.c | 17 +++++++---
>  2 files changed, 73 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7c41cb5d8a36..3fd5ab40b55c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>  	struct zone *zone = page_zone(free_page);
>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>  	unsigned long pfn;
> -	unsigned long flags;
>  	int free_page_order;
>  	int mt;
>  	int ret = 0;
>  
> -	if (split_pfn_offset == 0)
> -		return ret;
> +	/* zone lock should be held when this function is called */
> +	lockdep_assert_held(&zone->lock);
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
> +		return ret;
>  
>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>  		ret = -ENOENT;
> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>  	}
>  out:
> -	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>  }
>  /*
> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>  					unsigned int order) { return NULL; }
>  #endif
>  
> +/*
> + * Get first pfn of the free page, where pfn is in. If this free page does
> + * not exist, return the given pfn.
> + */
> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
> +{
> +	int order = 0;
> +	unsigned long start_pfn = pfn;
> +
> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
> +		if (++order > MAX_ORDER) {
> +			start_pfn = pfn;
> +			break;
> +		}
> +		start_pfn &= ~0UL << order;
> +	}
> +	return start_pfn;
> +}
> +
>  /*
>   * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  			  unsigned long end_pfn, int old_mt, int new_mt)
>  {
>  	struct page *page;
> -	unsigned long pfn;
> +	unsigned long pfn, pfn2;
>  	unsigned int order;
>  	int pages_moved = 0;
> +	unsigned long mt_change_pfn = start_pfn;
> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
> +
> +	/* split at start_pfn if it is in the middle of a free page */
> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
> +		struct page *new_page = pfn_to_page(new_start_pfn);
> +		int new_page_order = buddy_order(new_page);
> +
> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
> +			/* change migratetype so that split_free_page can work */
> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +			split_free_page(new_page, buddy_order(new_page),
> +					start_pfn - new_start_pfn);
> +
> +			mt_change_pfn = start_pfn;
> +			/* move to next page */
> +			start_pfn = new_start_pfn + (1 << new_page_order);
> +		}
> +	}

Ok, so if there is a straddle from the previous block into our block
of interest, it's split and the migratetype is set only on our block.

> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  
>  		order = buddy_order(page);
>  		move_to_free_list(page, zone, order, old_mt, new_mt);
> +		/*
> +		 * set page migratetype for all pageblocks within the page and
> +		 * only after we move all free pages in one pageblock
> +		 */
> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
> +			     pfn2 += pageblock_nr_pages) {
> +				set_pageblock_migratetype(pfn_to_page(pfn2),
> +							  new_mt);
> +				mt_change_pfn = pfn2;
> +			}

But if we have the first block of a MAX_ORDER chunk, then we don't
split but rather move the whole chunk and make sure to update the
chunk's blocks that are outside the range of interest.

It looks like either way would work, but why not split here as well
and keep the move contained to the block? Wouldn't this be a bit more
predictable and easier to understand?

> +		}
>  		pfn += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +	/* set migratetype for the remaining pageblocks */
> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);

I think I'm missing something for this.

- If there was no straddle, there is only our block of interest to
  update.

- If there was a straddle from the previous block, it was split and
  the block of interest was already updated. Nothing to do here?

- If there was a straddle into the next block, both blocks are updated
  to the new type. Nothing to do here?

What's the case where there are multiple blocks to update in the end?

> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  			int order = buddy_order(page);
>  
>  			if (pfn + (1UL << order) > boundary_pfn) {
> +				int res;
> +				unsigned long flags;
> +
> +				spin_lock_irqsave(&zone->lock, flags);
> +				res = split_free_page(page, order, boundary_pfn - pfn);
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +
>  				/* free page changed before split, check it again */
> -				if (split_free_page(page, order, boundary_pfn - pfn))
> +				if (res)
>  					continue;

At this point, we've already set the migratetype, which has handled
straddling free pages. Is this split still needed?

> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				/*
>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>  				 * no one else can grab the freed page after migration.
> -				 * Ideally, the page should be freed as two separate
> -				 * pages to be added into separate migratetype free
> -				 * lists.
> +				 * The page should be freed into separate migratetype
> +				 * free lists, unless the free page order is greater
> +				 * than pageblock order. It is not the case now,
> +				 * since gigantic hugetlb is freed as order-0
> +				 * pages and LRU pages do not cross pageblocks.
>  				 */
>  				if (isolate_page) {
>  					ret = set_migratetype_isolate(page, page_mt,

I hadn't thought about LRU pages being constrained to single
pageblocks before. Does this mean we only ever migrate here in case
there is a movable gigantic page? And since those are already split
during the free, does that mean the "reset pfn to head of the free
page" part after the migration is actually unnecessary?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-21 14:47                                   ` Zi Yan
  2023-09-25 21:12                                     ` Zi Yan
@ 2023-09-26 18:19                                     ` David Hildenbrand
  2023-09-28  3:22                                       ` Zi Yan
  1 sibling, 1 reply; 83+ messages in thread
From: David Hildenbrand @ 2023-09-26 18:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Johannes Weiner, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On 21.09.23 16:47, Zi Yan wrote:
> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
> 
>> On 21.09.23 04:31, Zi Yan wrote:
>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>
>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>
>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>    		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>
>>>>>>>>>>>    		/* Do not cross zone boundaries */
>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>    		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>> 	+#else
>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>> 	+#endif
>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>> 	 		return false;
>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>
>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be really clear,
>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>     path WITHOUT your change.
>>>>>>>>>
>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>
>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>
>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>
>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>
>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>> trying. Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>
>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>
>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>
>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>
>>>>>>>> while true; do
>>>>>>>>    echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>    echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>    wait
>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>> done
>>>>>>>>
>>>>>>>> I will look into the issue.
>>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>> several reboots and letting it run for minutes.
>>>>>
>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>> parameters respectively in half.
>>>>>
>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>> printk-tracing, the scenario seems to be this:
>>>>>
>>>>> #0                                                   #1
>>>>> start_isolate_page_range()
>>>>>     isolate_single_pageblock()
>>>>>       set_migratetype_isolate(tail)
>>>>>         lock zone->lock
>>>>>         move_freepages_block(tail) // nop
>>>>>         set_pageblock_migratetype(tail)
>>>>>         unlock zone->lock
>>>>>                                                        del_page_from_freelist(head)
>>>>>                                                        expand(head, head_mt)
>>>>>                                                          WARN(head_mt != tail_mt)
>>>>>       start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>       for (pfn = start_pfn, pfn < end_pfn)
>>>>>         if (PageBuddy())
>>>>>           split_free_page(head)
>>>>>
>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>> type is set and the lock has been dropped, so it's too late.
>>>>
>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>> granularity. With your changes below, it no longer works, because if there
>>>> is an unmovable page in
>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>> the allocation fails but it would succeed in current implementation.
>>>>
>>>> I think a proper fix would be to make move_freepages_block() split the
>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>
>>>> I am working on that.
>>>
>>> After spending half a day on this, I think it is much harder than I thought
>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>
>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>> free list yet.
>>>
>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>> the free pages.
>>>
>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>> will need to be split to get their subpages into the right free lists.
>>>
>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>> to correct the free list later. But to strictly enforce freelist migratetype
>>> hygiene, extra work is needed at free page path to split the free page into
>>> the right freelists.
>>>
>>> I need more time to think about how to get alloc_contig_range() properly.
>>> Help is needed for the bullet point 4.
>>
>>
>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
> 
> I have that in my backlog since you asked and have been delaying it. ;) Hopefully

It's complicated and I wish I would have had more time to review it
back then ... or now to clean it up later.

Unfortunately, nobody else did have the time to review it back then ... maybe we can
do better next time. David doesn't scale.

Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
really is sub-optimal (and mostly code duplication from alloc_contig_range).

> I can do it after I fix this. That change might or might not help only if we make
> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> overwrite existing migratetype, the code might not need to split a page and move
> it to MIGRATE_ISOLATE freelist?

Did someone test how memory offlining plays along with that? (I can try myself
within the next 1-2 weeks)

There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
though.

ret = start_isolate_page_range(start_pfn, end_pfn,
			       MIGRATE_MOVABLE,
			       MEMORY_OFFLINE | REPORT_FAILURE,
			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

> 
> The fundamental issue in alloc_contig_range() is that to work at
> pageblock level, a page (>pageblock_order) can have one part is isolated and
> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> now checks first pageblock migratetype, so such a page needs to be removed
> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> finally put back to multiple free lists. This needs to be done at isolation stage
> before free pages are removed from their free lists (the stage after isolation).

One idea was to always isolate larger chunks, and handle movability checks/split/etc
at a later stage. Once isolation would be decoupled from the actual/original migratetype,
the could have been easier to handle (especially some corner cases I had in mind back then).

> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> in their original migratetype and check migratetype before allocating a page,
> that might help. But that might add extra work (e.g., splitting a partially
> isolated free page before allocation) in the really hot code path, which is not
> desirable.

With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
separate isolate list, but one per "proper migratetype". But again, just some random
thoughts I had back then, I never had sufficient time to think it all through.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
                     ` (2 preceding siblings ...)
  2023-09-14  9:56   ` Mel Gorman
@ 2023-09-27  5:42   ` Huang, Ying
  2023-09-27 14:51     ` Johannes Weiner
  3 siblings, 1 reply; 83+ messages in thread
From: Huang, Ying @ 2023-09-27  5:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

Johannes Weiner <hannes@cmpxchg.org> writes:

> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
>
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.

I suspected the PCP allocating/freeing path may be influenced (that is,
allocating/freeing batch is less than PCP high).  So I tested
one-process will-it-scale/page_fault1 with sysctl
percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
from/to PCP only.  The test results are as follows,

Before:
will-it-scale.1.processes                        618364.3      (+-  0.075%)
perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)

After:
will-it-scale.1.processes	                 616512.0      (+-  0.057%)
perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)

The change isn't large: -0.3%.  Perf profiling shows the cycles% of
get_pfnblock_flags_mask() increases.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-27  5:42   ` Huang, Ying
@ 2023-09-27 14:51     ` Johannes Weiner
  2023-09-30  4:26       ` Huang, Ying
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-09-27 14:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> 
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> >
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> >
> > Before:
> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> >
> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> >
> > After:
> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> >
> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> >
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
> 
> I suspected the PCP allocating/freeing path may be influenced (that is,
> allocating/freeing batch is less than PCP high).  So I tested
> one-process will-it-scale/page_fault1 with sysctl
> percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
> from/to PCP only.  The test results are as follows,
> 
> Before:
> will-it-scale.1.processes                        618364.3      (+-  0.075%)
> perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)
> 
> After:
> will-it-scale.1.processes	                 616512.0      (+-  0.057%)
> perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)
> 
> The change isn't large: -0.3%.  Perf profiling shows the cycles% of
> get_pfnblock_flags_mask() increases.

Ah, this is going through the free_unref_page_list() path that
Vlastimil had pointed out as well. I made another change on top that
eliminates the second lookup. After that, both pcp fast paths have the
same number of lookups as before: 1. This fixes the regression for me.

Would you mind confirming this as well?

--

From f5d032019ed832a1a50454347a33b00ca6abeb30 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 16:03:24 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()

Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.

Minor comment and local variable scoping cleanups.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 44 ++++++++++++++++++--------------------------
 1 file changed, 18 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfffc1af94cd..665930ffe22a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2466,48 +2466,40 @@ void free_unref_page_list(struct list_head *list)
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
-	int migratetype;
-
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
-
-		if (!free_pages_prepare(page, 0, FPI_NONE)) {
-			list_del(&page->lru);
-			continue;
-		}
 
-		/*
-		 * Free isolated pages directly to the allocator, see
-		 * comment in free_unref_page.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-		if (unlikely(is_migrate_isolate(migratetype))) {
+	list_for_each_entry_safe(page, next, list, lru)
+		if (!free_pages_prepare(page, 0, FPI_NONE))
 			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
-			continue;
-		}
-	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
+		int migratetype;
 
 		list_del(&page->lru);
 		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * Zone switch, batch complete, or non-pcp freeing?
+		 * Drop the pcp lock and evaluate.
 		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		if (unlikely(zone != locked_zone ||
+			     batch_count == SWAP_CLUSTER_MAX ||
+			     is_migrate_isolate(migratetype))) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
+				locked_zone = NULL;
 			}
 
-			batch_count = 0;
+			/*
+			 * Free isolated pages directly to the
+			 * allocator, see comment in free_unref_page.
+			 */
+			if (is_migrate_isolate(migratetype)) {
+				free_one_page(zone, page, pfn, 0, FPI_NONE);
+				continue;
+			}
 
 			/*
 			 * trylock is necessary as pages may be getting freed
@@ -2518,10 +2510,10 @@ void free_unref_page_list(struct list_head *list)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, page, pfn, 0, FPI_NONE);
-				locked_zone = NULL;
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-26 17:39                                       ` Johannes Weiner
@ 2023-09-28  2:51                                         ` Zi Yan
  2023-10-03  2:26                                           ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-28  2:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 24241 bytes --]

On 26 Sep 2023, at 13:39, Johannes Weiner wrote:

> On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
>> On 21 Sep 2023, at 10:47, Zi Yan wrote:
>>
>>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>>
>>>> On 21.09.23 04:31, Zi Yan wrote:
>>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>>
>>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>>
>>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>>> 	+#else
>>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>>
>>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just to be really clear,
>>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>>>
>>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>>
>>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>>
>>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>>
>>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>>> script.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>>
>>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>>
>>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>>
>>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>>
>>>>>>>>>> while true; do
>>>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>>   wait
>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> I will look into the issue.
>>>>>>>>
>>>>>>>> Nice!
>>>>>>>>
>>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>>> several reboots and letting it run for minutes.
>>>>>>>
>>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>>> parameters respectively in half.
>>>>>>>
>>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>>
>>>>>>> #0                                                   #1
>>>>>>> start_isolate_page_range()
>>>>>>>    isolate_single_pageblock()
>>>>>>>      set_migratetype_isolate(tail)
>>>>>>>        lock zone->lock
>>>>>>>        move_freepages_block(tail) // nop
>>>>>>>        set_pageblock_migratetype(tail)
>>>>>>>        unlock zone->lock
>>>>>>>                                                       del_page_from_freelist(head)
>>>>>>>                                                       expand(head, head_mt)
>>>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>>>        if (PageBuddy())
>>>>>>>          split_free_page(head)
>>>>>>>
>>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>>
>>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>>> granularity. With your changes below, it no longer works, because if there
>>>>>> is an unmovable page in
>>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>>> the allocation fails but it would succeed in current implementation.
>>>>>>
>>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>>
>>>>>> I am working on that.
>>>>>
>>>>> After spending half a day on this, I think it is much harder than I thought
>>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>>
>>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>>> free list yet.
>>>>>
>>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>>> the free pages.
>>>>>
>>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>>> will need to be split to get their subpages into the right free lists.
>>>>>
>>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>>> hygiene, extra work is needed at free page path to split the free page into
>>>>> the right freelists.
>>>>>
>>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>>> Help is needed for the bullet point 4.
>>>>
>>>>
>>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>>
>>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>>> I can do it after I fix this. That change might or might not help only if we make
>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>> overwrite existing migratetype, the code might not need to split a page and move
>>> it to MIGRATE_ISOLATE freelist?
>>>
>>> The fundamental issue in alloc_contig_range() is that to work at
>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>> now checks first pageblock migratetype, so such a page needs to be removed
>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>> before free pages are removed from their free lists (the stage after isolation).
>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>> in their original migratetype and check migratetype before allocating a page,
>>> that might help. But that might add extra work (e.g., splitting a partially
>>> isolated free page before allocation) in the really hot code path, which is not
>>> desirable.
>>>
>>>>
>>>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>>>
>>> I will try my best to simplify it.
>>
>> Hi Johannes,
>>
>> I attached three patches to fix the issue and first two can be folded into
>> your patchset:
>
> Hi Zi, thanks for providing these patches! I'll pick them up into the
> series.
>
>> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
>> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
>> 3. enable move_freepages() to split a free page that is partially covered by
>>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>>    a >pageblock_order free page is moved. Before when a >pageblock_order
>>    free page is moved, only first pageblock migratetype is changed. The added
>>    WARN_ON_ONCE might be triggered by these pages.
>>
>> I ran Mike's test with transhuge-stress together with my patches on top of your
>> "close migratetype race" patch for more than an hour without any warning.
>> It should unblock your patchset. I will keep working on alloc_contig_range()
>> simplification.
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
>
>> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Fri, 22 Sep 2023 11:11:32 -0400
>> Subject: [PATCH 1/3] mm: fix __free_one_page().
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/page_alloc.c | 6 +-----
>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7de022bc4c7d..72f27d14c8e7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>>
>>  	while (order < MAX_ORDER) {
>> -		int buddy_mt;
>> -
>>  		if (compaction_capture(capc, page, order, migratetype))
>>  			return;
>>
>> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>>  		if (!buddy)
>>  			goto done_merging;
>>
>> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>> -
>>  		if (unlikely(order >= pageblock_order)) {
>>  			/*
>>  			 * We want to prevent merge between freepages on pageblock
>> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>>  		if (page_is_guard(buddy))
>>  			clear_page_guard(zone, buddy, order);
>>  		else
>> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
>> +			del_page_from_free_list(buddy, zone, order, migratetype);
>>  		combined_pfn = buddy_pfn & pfn;
>>  		page = page + (combined_pfn - pfn);
>>  		pfn = combined_pfn;
>
> I had a fix for this that's slightly different. The buddy's type can't
> be changed while it's still on the freelist, so I moved that
> around. The sequence now is:
>
> 	int buddy_mt = migratetype;
>
> 	if (unlikely(order >= pageblock_order)) {
> 		/* This is the only case where buddy_mt can differ */
> 		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
> 		// compat checks...
> 	}
>
> 	del_page_from_free_list(buddy, buddy_mt);
>
> 	if (unlikely(buddy_mt != migratetype))
> 		set_pageblock_migratetype(buddy, migratetype);
>
>
>> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>>  free lists.
>>
>> This avoids changing migratetype after move_freepages() or
>> move_freepages_block(), which is error prone. It also prepares for upcoming
>> changes to fix move_freepages() not moving free pages partially in the
>> range.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> This makes the code much cleaner, thank you!
>
>> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:55:18 -0400
>> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>>  pages.
>>
>> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
>> move_freepages(), to isolate free pages. But move_freepages() was not able
>> to move free pages partially covered by the specified range, leaving a race
>> window open[1]. Fix it by teaching move_freepages() to split a free page
>> when only part of it is going to be moved.
>>
>> In addition, when a >pageblock_order free page is moved, only its first
>> pageblock migratetype is changed. It can cause warnings later. Fix it by
>> set all pageblocks in a free page to the same migratetype after move.
>>
>> split_free_page() is changed to be used in move_freepages() and
>> isolate_single_pageblock(). A common code to find the start pfn of a free
>> page is refactored in get_freepage_start_pfn().
>>
>> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>>  mm/page_isolation.c | 17 +++++++---
>>  2 files changed, 73 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7c41cb5d8a36..3fd5ab40b55c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>>  	struct zone *zone = page_zone(free_page);
>>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>>  	unsigned long pfn;
>> -	unsigned long flags;
>>  	int free_page_order;
>>  	int mt;
>>  	int ret = 0;
>>
>> -	if (split_pfn_offset == 0)
>> -		return ret;
>> +	/* zone lock should be held when this function is called */
>> +	lockdep_assert_held(&zone->lock);
>>
>> -	spin_lock_irqsave(&zone->lock, flags);
>> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
>> +		return ret;
>>
>>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>>  		ret = -ENOENT;
>> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>>  	}
>>  out:
>> -	spin_unlock_irqrestore(&zone->lock, flags);
>>  	return ret;
>>  }
>>  /*
>> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>  					unsigned int order) { return NULL; }
>>  #endif
>>
>> +/*
>> + * Get first pfn of the free page, where pfn is in. If this free page does
>> + * not exist, return the given pfn.
>> + */
>> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
>> +{
>> +	int order = 0;
>> +	unsigned long start_pfn = pfn;
>> +
>> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
>> +		if (++order > MAX_ORDER) {
>> +			start_pfn = pfn;
>> +			break;
>> +		}
>> +		start_pfn &= ~0UL << order;
>> +	}
>> +	return start_pfn;
>> +}
>> +
>>  /*
>>   * Move the free pages in a range to the freelist tail of the requested type.
>>   * Note that start_page and end_pages are not aligned on a pageblock
>> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>  {
>>  	struct page *page;
>> -	unsigned long pfn;
>> +	unsigned long pfn, pfn2;
>>  	unsigned int order;
>>  	int pages_moved = 0;
>> +	unsigned long mt_change_pfn = start_pfn;
>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>> +
>> +	/* split at start_pfn if it is in the middle of a free page */
>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>> +		int new_page_order = buddy_order(new_page);
>> +
>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>> +			/* change migratetype so that split_free_page can work */
>> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +			split_free_page(new_page, buddy_order(new_page),
>> +					start_pfn - new_start_pfn);
>> +
>> +			mt_change_pfn = start_pfn;
>> +			/* move to next page */
>> +			start_pfn = new_start_pfn + (1 << new_page_order);
>> +		}
>> +	}
>
> Ok, so if there is a straddle from the previous block into our block
> of interest, it's split and the migratetype is set only on our block.

Correct. For example, start_pfn is 0x200 (2MB) and the free page starting from 0x0 is order-10 (4MB).

>
>> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>
>>  		order = buddy_order(page);
>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>> +		/*
>> +		 * set page migratetype for all pageblocks within the page and
>> +		 * only after we move all free pages in one pageblock
>> +		 */
>> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
>> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
>> +			     pfn2 += pageblock_nr_pages) {
>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>> +							  new_mt);
>> +				mt_change_pfn = pfn2;
>> +			}
>
> But if we have the first block of a MAX_ORDER chunk, then we don't
> split but rather move the whole chunk and make sure to update the
> chunk's blocks that are outside the range of interest.
>
> It looks like either way would work, but why not split here as well
> and keep the move contained to the block? Wouldn't this be a bit more
> predictable and easier to understand?

Yes, having a split here would be consistent.

Also I want to spell out the corner case I am handling here (and I will add
it to the comment): since move_to_free_list() checks page's migratetype
with old_mt and changing one page' migratetype affects all pages within
the same pageblock, if we are moving more than one free pages that are
in the same pageblock, setting migratetype right after move_to_free_list()
triggers the warning.

>> +		}
>>  		pfn += 1 << order;
>>  		pages_moved += 1 << order;
>>  	}
>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +	/* set migratetype for the remaining pageblocks */
>> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>
> I think I'm missing something for this.
>
> - If there was no straddle, there is only our block of interest to
>   update.
>
> - If there was a straddle from the previous block, it was split and
>   the block of interest was already updated. Nothing to do here?
>
> - If there was a straddle into the next block, both blocks are updated
>   to the new type. Nothing to do here?
>
> What's the case where there are multiple blocks to update in the end?

When a pageblock has free pages at the beginning and in-use pages at the end.
The pageblock migratetype is not changed in the for loop above, since free
pages do not cross pageblock boundary. But these free pages are moved
to a new mt free list and will trigger warnings later.

Also if multiple pageblocks are filled with only in-use pages, the for loop
does nothing either. Their pageblocks will be set at this moment. I notice
it might be a change of behavior as I am writing, but this change might
be better. Before, in-page migrateype might or might not be changed,
depending on if there is a free page in the same pageblock or not, meaning
there will be migratetype holes in the specified range. Now the whole range
is changed to new_mt. Let me know if you have a different opinion.


>> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>  			int order = buddy_order(page);
>>
>>  			if (pfn + (1UL << order) > boundary_pfn) {
>> +				int res;
>> +				unsigned long flags;
>> +
>> +				spin_lock_irqsave(&zone->lock, flags);
>> +				res = split_free_page(page, order, boundary_pfn - pfn);
>> +				spin_unlock_irqrestore(&zone->lock, flags);
>> +
>>  				/* free page changed before split, check it again */
>> -				if (split_free_page(page, order, boundary_pfn - pfn))
>> +				if (res)
>>  					continue;
>
> At this point, we've already set the migratetype, which has handled
> straddling free pages. Is this split still needed?

Good point. I will remove it. Originally, I thought it should stay to handle
the free page coming from the migration below. But unless a greater than pageblock
order in-use page shows up in the system and it is freed directly via __free_pages(),
any free page coming from the migration below should be put in the right
free list.

Such > pageblock order pages are possible, only if we have >PMD order THPs
or __PageMovable. IIRC, both do not exist yet.

>
>> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>  				/*
>>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>>  				 * no one else can grab the freed page after migration.
>> -				 * Ideally, the page should be freed as two separate
>> -				 * pages to be added into separate migratetype free
>> -				 * lists.
>> +				 * The page should be freed into separate migratetype
>> +				 * free lists, unless the free page order is greater
>> +				 * than pageblock order. It is not the case now,
>> +				 * since gigantic hugetlb is freed as order-0
>> +				 * pages and LRU pages do not cross pageblocks.
>>  				 */
>>  				if (isolate_page) {
>>  					ret = set_migratetype_isolate(page, page_mt,
>
> I hadn't thought about LRU pages being constrained to single
> pageblocks before. Does this mean we only ever migrate here in case

Initially, I thought a lot about what if a high order folio crosses
two adjacent pageblocks, but at the end I find that __find_buddy_pfn()
does not treat pfns from adjacent pageblocks as buddy unless order
is greater than pageblock order. So any high order folio from
buddy allocator does not cross pageblocks. That is a relief.

Another (future) possibility is once anon large folio is merged and
my split huge page to any lower order patches are merged, a high order
folio might not come directly from buddy allocator but from a huge page
split. But that requires a > pageblock order folio exist first, which
is not possible either. So we are good.

> there is a movable gigantic page? And since those are already split
> during the free, does that mean the "reset pfn to head of the free
> page" part after the migration is actually unnecessary?

Yes. the "reset pfn" code could be removed.

Thank you for the review. Really appreciate it. Let me revise my
patch 3 and send it out again.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-26 18:19                                     ` David Hildenbrand
@ 2023-09-28  3:22                                       ` Zi Yan
  2023-10-02 11:43                                         ` David Hildenbrand
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-09-28  3:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 11866 bytes --]

On 26 Sep 2023, at 14:19, David Hildenbrand wrote:

> On 21.09.23 16:47, Zi Yan wrote:
>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>
>>> On 21.09.23 04:31, Zi Yan wrote:
>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>
>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>
>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>    		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>
>>>>>>>>>>>>    		/* Do not cross zone boundaries */
>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>    		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>> 	+#else
>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>
>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Just to be really clear,
>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>     path WITHOUT your change.
>>>>>>>>>>
>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>
>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>
>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>
>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>> script.
>>>>>>>>>>>>
>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>
>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>
>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>
>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>
>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>
>>>>>>>>> while true; do
>>>>>>>>>    echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>    echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>    wait
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> I will look into the issue.
>>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>> several reboots and letting it run for minutes.
>>>>>>
>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>> parameters respectively in half.
>>>>>>
>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>
>>>>>> #0                                                   #1
>>>>>> start_isolate_page_range()
>>>>>>     isolate_single_pageblock()
>>>>>>       set_migratetype_isolate(tail)
>>>>>>         lock zone->lock
>>>>>>         move_freepages_block(tail) // nop
>>>>>>         set_pageblock_migratetype(tail)
>>>>>>         unlock zone->lock
>>>>>>                                                        del_page_from_freelist(head)
>>>>>>                                                        expand(head, head_mt)
>>>>>>                                                          WARN(head_mt != tail_mt)
>>>>>>       start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>       for (pfn = start_pfn, pfn < end_pfn)
>>>>>>         if (PageBuddy())
>>>>>>           split_free_page(head)
>>>>>>
>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>
>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>> granularity. With your changes below, it no longer works, because if there
>>>>> is an unmovable page in
>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>> the allocation fails but it would succeed in current implementation.
>>>>>
>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>
>>>>> I am working on that.
>>>>
>>>> After spending half a day on this, I think it is much harder than I thought
>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>
>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>> free list yet.
>>>>
>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>> the free pages.
>>>>
>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>> will need to be split to get their subpages into the right free lists.
>>>>
>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>> hygiene, extra work is needed at free page path to split the free page into
>>>> the right freelists.
>>>>
>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>> Help is needed for the bullet point 4.
>>>
>>>
>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>
>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>
> It's complicated and I wish I would have had more time to review it
> back then ... or now to clean it up later.
>
> Unfortunately, nobody else did have the time to review it back then ... maybe we can
> do better next time. David doesn't scale.
>
> Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
> really is sub-optimal (and mostly code duplication from alloc_contig_range).

I felt the same when I wrote the code. But I thought it was the only way out.

>
>> I can do it after I fix this. That change might or might not help only if we make
>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>> overwrite existing migratetype, the code might not need to split a page and move
>> it to MIGRATE_ISOLATE freelist?
>
> Did someone test how memory offlining plays along with that? (I can try myself
> within the next 1-2 weeks)
>
> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
> though.
>
> ret = start_isolate_page_range(start_pfn, end_pfn,
> 			       MIGRATE_MOVABLE,
> 			       MEMORY_OFFLINE | REPORT_FAILURE,
> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

Since a full MAX_ORDER range is passed, no free page split will happen.

>
>>
>> The fundamental issue in alloc_contig_range() is that to work at
>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>> now checks first pageblock migratetype, so such a page needs to be removed
>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>> finally put back to multiple free lists. This needs to be done at isolation stage
>> before free pages are removed from their free lists (the stage after isolation).
>
> One idea was to always isolate larger chunks, and handle movability checks/split/etc
> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
> the could have been easier to handle (especially some corner cases I had in mind back then).

I think it is a good idea. When I coded alloc_contig_range() up, I tried to
accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
Let me give this a try.

>
>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>> in their original migratetype and check migratetype before allocating a page,
>> that might help. But that might add extra work (e.g., splitting a partially
>> isolated free page before allocation) in the really hot code path, which is not
>> desirable.
>
> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
> separate isolate list, but one per "proper migratetype". But again, just some random
> thoughts I had back then, I never had sufficient time to think it all through.

Got it. I will think about it.

One question on separate MIGRATE_ISOLATE:

the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
bit instead of being one of migratetype. But now there are 5 migratetypes +
MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
But current migratetype implementation is a word-based operation, requiring
NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
additional changes. Let me know if you have a better idea. Thanks.



--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-27 14:51     ` Johannes Weiner
@ 2023-09-30  4:26       ` Huang, Ying
  2023-10-02 14:58         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Huang, Ying @ 2023-09-30  4:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> 
>> > The idea behind the cache is to save get_pageblock_migratetype()
>> > lookups during bulk freeing. A microbenchmark suggests this isn't
>> > helping, though. The pcp migratetype can get stale, which means that
>> > bulk freeing has an extra branch to check if the pageblock was
>> > isolated while on the pcp.
>> >
>> > While the variance overlaps, the cache write and the branch seem to
>> > make this a net negative. The following test allocates and frees
>> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>> >
>> > Before:
>> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>> >
>> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>> >
>> > After:
>> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>> >
>> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>> >
>> > A side effect is that this also ensures that pages whose pageblock
>> > gets stolen while on the pcplist end up on the right freelist and we
>> > don't perform potentially type-incompatible buddy merges (or skip
>> > merges when we shouldn't), whis is likely beneficial to long-term
>> > fragmentation management, although the effects would be harder to
>> > measure. Settle for simpler and faster code as justification here.
>> 
>> I suspected the PCP allocating/freeing path may be influenced (that is,
>> allocating/freeing batch is less than PCP high).  So I tested
>> one-process will-it-scale/page_fault1 with sysctl
>> percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
>> from/to PCP only.  The test results are as follows,
>> 
>> Before:
>> will-it-scale.1.processes                        618364.3      (+-  0.075%)
>> perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)
>> 
>> After:
>> will-it-scale.1.processes	                 616512.0      (+-  0.057%)
>> perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)
>> 
>> The change isn't large: -0.3%.  Perf profiling shows the cycles% of
>> get_pfnblock_flags_mask() increases.
>
> Ah, this is going through the free_unref_page_list() path that
> Vlastimil had pointed out as well. I made another change on top that
> eliminates the second lookup. After that, both pcp fast paths have the
> same number of lookups as before: 1. This fixes the regression for me.
>
> Would you mind confirming this as well?

I have done more test for the series and addon patches.  The test
results are as follows,

base
perf-profile.children.get_pfnblock_flags_mask	     0.15	(+- 32.62%)
will-it-scale.1.processes			618621.7	(+-  0.18%)

mm: page_alloc: remove pcppage migratetype caching
perf-profile.children.get_pfnblock_flags_mask	     0.40	(+- 21.55%)
will-it-scale.1.processes			616350.3	(+-  0.27%)

mm: page_alloc: fix up block types when merging compatible blocks
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+-  8.36%)
will-it-scale.1.processes			617121.0	(+-  0.17%)

mm: page_alloc: move free pages when converting block during isolation
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 15.10%)
will-it-scale.1.processes			615578.0	(+-  0.18%)

mm: page_alloc: fix move_freepages_block() range error
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 12.78%)
will-it-scale.1.processes			615364.7	(+-  0.27%)

mm: page_alloc: fix freelist movement during block conversion
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 10.52%)
will-it-scale.1.processes			617834.8	(+-  0.52%)

mm: page_alloc: consolidate free page accounting
perf-profile.children.get_pfnblock_flags_mask	     0.39	(+-  8.27%)
will-it-scale.1.processes			621000.0	(+-  0.13%)

mm: page_alloc: close migratetype race between freeing and stealing
perf-profile.children.get_pfnblock_flags_mask	     0.37	(+-  5.87%)
will-it-scale.1.processes			618378.8	(+-  0.17%)

mm: page_alloc: optimize free_unref_page_list()
perf-profile.children.get_pfnblock_flags_mask	     0.20	(+- 14.96%)
will-it-scale.1.processes			618136.3	(+-  0.16%)

It seems that the will-it-scale score is influenced by some other
factors too.  But anyway, the series + addon patches restores the score
of will-it-scale.  And the cycles% of get_pfnblock_flags_mask() is
almost restored by the final patch (mm: page_alloc: optimize
free_unref_page_list()).

Feel free to add my "Tested-by" for these patches.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-28  3:22                                       ` Zi Yan
@ 2023-10-02 11:43                                         ` David Hildenbrand
  2023-10-03  2:35                                           ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: David Hildenbrand @ 2023-10-02 11:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: Johannes Weiner, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

>>> I can do it after I fix this. That change might or might not help only if we make
>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>> overwrite existing migratetype, the code might not need to split a page and move
>>> it to MIGRATE_ISOLATE freelist?
>>
>> Did someone test how memory offlining plays along with that? (I can try myself
>> within the next 1-2 weeks)
>>
>> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
>> though.
>>
>> ret = start_isolate_page_range(start_pfn, end_pfn,
>> 			       MIGRATE_MOVABLE,
>> 			       MEMORY_OFFLINE | REPORT_FAILURE,
>> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
> 
> Since a full MAX_ORDER range is passed, no free page split will happen.

Okay, thanks for verifying that it should not be affected!

> 
>>
>>>
>>> The fundamental issue in alloc_contig_range() is that to work at
>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>> now checks first pageblock migratetype, so such a page needs to be removed
>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>> before free pages are removed from their free lists (the stage after isolation).
>>
>> One idea was to always isolate larger chunks, and handle movability checks/split/etc
>> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
>> the could have been easier to handle (especially some corner cases I had in mind back then).
> 
> I think it is a good idea. When I coded alloc_contig_range() up, I tried to
> accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
> If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
> ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
> Let me give this a try.
> 

But again, just some thought I had back then, maybe it doesn't help for 
anything; I found more time to look into the whole thing in more detail.

>>
>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>> in their original migratetype and check migratetype before allocating a page,
>>> that might help. But that might add extra work (e.g., splitting a partially
>>> isolated free page before allocation) in the really hot code path, which is not
>>> desirable.
>>
>> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
>> separate isolate list, but one per "proper migratetype". But again, just some random
>> thoughts I had back then, I never had sufficient time to think it all through.
> 
> Got it. I will think about it.
> 
> One question on separate MIGRATE_ISOLATE:
> 
> the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
> bit instead of being one of migratetype. But now there are 5 migratetypes +

Exactly what I was concerned about back then ...

> MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
> But current migratetype implementation is a word-based operation, requiring
> NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
> needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.

... until I did the math. Let's assume a pageblock is 2 MiB.

4/(2* 1024 * 1024 * 8) = 0,00002384185791016 %

8/(2* 1024 * 1024 * 8) -> 1 / (2* 1024 * 1024) = 0,00004768371582031 %

For a 1 TiB machine that means 256 KiB vs. 512 KiB

I concluded that "wasting a lot of space" is not really the right word 
to describe that :)

Just to put it into perspective, the memmap (64/4096) for a 1 TiB 
machine is ... 16 GiB.

> An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
> additional changes. Let me know if you have a better idea. Thanks.

It would probably be cleanest to just use one byte per pageblock. That 
would cleanup the whole machinery eventually as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
  2023-09-30  4:26       ` Huang, Ying
@ 2023-10-02 14:58         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-10-02 14:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Miaohe Lin,
	Kefeng Wang, Zi Yan, linux-mm, linux-kernel

On Sat, Sep 30, 2023 at 12:26:01PM +0800, Huang, Ying wrote:
> I have done more test for the series and addon patches.  The test
> results are as follows,
> 
> base
> perf-profile.children.get_pfnblock_flags_mask	     0.15	(+- 32.62%)
> will-it-scale.1.processes			618621.7	(+-  0.18%)
> 
> mm: page_alloc: remove pcppage migratetype caching
> perf-profile.children.get_pfnblock_flags_mask	     0.40	(+- 21.55%)
> will-it-scale.1.processes			616350.3	(+-  0.27%)
> 
> mm: page_alloc: fix up block types when merging compatible blocks
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+-  8.36%)
> will-it-scale.1.processes			617121.0	(+-  0.17%)
> 
> mm: page_alloc: move free pages when converting block during isolation
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 15.10%)
> will-it-scale.1.processes			615578.0	(+-  0.18%)
> 
> mm: page_alloc: fix move_freepages_block() range error
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 12.78%)
> will-it-scale.1.processes			615364.7	(+-  0.27%)
> 
> mm: page_alloc: fix freelist movement during block conversion
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 10.52%)
> will-it-scale.1.processes			617834.8	(+-  0.52%)
> 
> mm: page_alloc: consolidate free page accounting
> perf-profile.children.get_pfnblock_flags_mask	     0.39	(+-  8.27%)
> will-it-scale.1.processes			621000.0	(+-  0.13%)
> 
> mm: page_alloc: close migratetype race between freeing and stealing
> perf-profile.children.get_pfnblock_flags_mask	     0.37	(+-  5.87%)
> will-it-scale.1.processes			618378.8	(+-  0.17%)
> 
> mm: page_alloc: optimize free_unref_page_list()
> perf-profile.children.get_pfnblock_flags_mask	     0.20	(+- 14.96%)
> will-it-scale.1.processes			618136.3	(+-  0.16%)
> 
> It seems that the will-it-scale score is influenced by some other
> factors too.  But anyway, the series + addon patches restores the score
> of will-it-scale.  And the cycles% of get_pfnblock_flags_mask() is
> almost restored by the final patch (mm: page_alloc: optimize
> free_unref_page_list()).
> 
> Feel free to add my "Tested-by" for these patches.

Thanks, I'll add those!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-09-28  2:51                                         ` Zi Yan
@ 2023-10-03  2:26                                           ` Zi Yan
  2023-10-10 21:12                                             ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-03  2:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 25101 bytes --]

On 27 Sep 2023, at 22:51, Zi Yan wrote:

> On 26 Sep 2023, at 13:39, Johannes Weiner wrote:
>
>> On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
>>> On 21 Sep 2023, at 10:47, Zi Yan wrote:
>>>
>>>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>>>
>>>>> On 21.09.23 04:31, Zi Yan wrote:
>>>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>>>
>>>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>>>
>>>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>>>> 	+#else
>>>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Just to be really clear,
>>>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>>>>
>>>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>>>
>>>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>>>
>>>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>>>
>>>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>>>> script.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>>>
>>>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>>>
>>>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>>>
>>>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>>>
>>>>>>>>>>> while true; do
>>>>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>>>   wait
>>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>>>> done
>>>>>>>>>>>
>>>>>>>>>>> I will look into the issue.
>>>>>>>>>
>>>>>>>>> Nice!
>>>>>>>>>
>>>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>>>> several reboots and letting it run for minutes.
>>>>>>>>
>>>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>>>> parameters respectively in half.
>>>>>>>>
>>>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>>>
>>>>>>>> #0                                                   #1
>>>>>>>> start_isolate_page_range()
>>>>>>>>    isolate_single_pageblock()
>>>>>>>>      set_migratetype_isolate(tail)
>>>>>>>>        lock zone->lock
>>>>>>>>        move_freepages_block(tail) // nop
>>>>>>>>        set_pageblock_migratetype(tail)
>>>>>>>>        unlock zone->lock
>>>>>>>>                                                       del_page_from_freelist(head)
>>>>>>>>                                                       expand(head, head_mt)
>>>>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>>>>        if (PageBuddy())
>>>>>>>>          split_free_page(head)
>>>>>>>>
>>>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>>>
>>>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>>>> granularity. With your changes below, it no longer works, because if there
>>>>>>> is an unmovable page in
>>>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>>>> the allocation fails but it would succeed in current implementation.
>>>>>>>
>>>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>>>
>>>>>>> I am working on that.
>>>>>>
>>>>>> After spending half a day on this, I think it is much harder than I thought
>>>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>>>
>>>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>>>> free list yet.
>>>>>>
>>>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>>>> the free pages.
>>>>>>
>>>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>>>> will need to be split to get their subpages into the right free lists.
>>>>>>
>>>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>>>> hygiene, extra work is needed at free page path to split the free page into
>>>>>> the right freelists.
>>>>>>
>>>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>>>> Help is needed for the bullet point 4.
>>>>>
>>>>>
>>>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>>>
>>>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>>>> I can do it after I fix this. That change might or might not help only if we make
>>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>>> overwrite existing migratetype, the code might not need to split a page and move
>>>> it to MIGRATE_ISOLATE freelist?
>>>>
>>>> The fundamental issue in alloc_contig_range() is that to work at
>>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>>> now checks first pageblock migratetype, so such a page needs to be removed
>>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>>> before free pages are removed from their free lists (the stage after isolation).
>>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>>> in their original migratetype and check migratetype before allocating a page,
>>>> that might help. But that might add extra work (e.g., splitting a partially
>>>> isolated free page before allocation) in the really hot code path, which is not
>>>> desirable.
>>>>
>>>>>
>>>>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>>>>
>>>> I will try my best to simplify it.
>>>
>>> Hi Johannes,
>>>
>>> I attached three patches to fix the issue and first two can be folded into
>>> your patchset:
>>
>> Hi Zi, thanks for providing these patches! I'll pick them up into the
>> series.
>>
>>> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
>>> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
>>> 3. enable move_freepages() to split a free page that is partially covered by
>>>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>>>    a >pageblock_order free page is moved. Before when a >pageblock_order
>>>    free page is moved, only first pageblock migratetype is changed. The added
>>>    WARN_ON_ONCE might be triggered by these pages.
>>>
>>> I ran Mike's test with transhuge-stress together with my patches on top of your
>>> "close migratetype race" patch for more than an hour without any warning.
>>> It should unblock your patchset. I will keep working on alloc_contig_range()
>>> simplification.
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>>> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Fri, 22 Sep 2023 11:11:32 -0400
>>> Subject: [PATCH 1/3] mm: fix __free_one_page().
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/page_alloc.c | 6 +-----
>>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 7de022bc4c7d..72f27d14c8e7 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>>>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>>>
>>>  	while (order < MAX_ORDER) {
>>> -		int buddy_mt;
>>> -
>>>  		if (compaction_capture(capc, page, order, migratetype))
>>>  			return;
>>>
>>> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>>>  		if (!buddy)
>>>  			goto done_merging;
>>>
>>> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>>> -
>>>  		if (unlikely(order >= pageblock_order)) {
>>>  			/*
>>>  			 * We want to prevent merge between freepages on pageblock
>>> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>>>  		if (page_is_guard(buddy))
>>>  			clear_page_guard(zone, buddy, order);
>>>  		else
>>> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
>>> +			del_page_from_free_list(buddy, zone, order, migratetype);
>>>  		combined_pfn = buddy_pfn & pfn;
>>>  		page = page + (combined_pfn - pfn);
>>>  		pfn = combined_pfn;
>>
>> I had a fix for this that's slightly different. The buddy's type can't
>> be changed while it's still on the freelist, so I moved that
>> around. The sequence now is:
>>
>> 	int buddy_mt = migratetype;
>>
>> 	if (unlikely(order >= pageblock_order)) {
>> 		/* This is the only case where buddy_mt can differ */
>> 		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>> 		// compat checks...
>> 	}
>>
>> 	del_page_from_free_list(buddy, buddy_mt);
>>
>> 	if (unlikely(buddy_mt != migratetype))
>> 		set_pageblock_migratetype(buddy, migratetype);
>>
>>
>>> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>>> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>>>  free lists.
>>>
>>> This avoids changing migratetype after move_freepages() or
>>> move_freepages_block(), which is error prone. It also prepares for upcoming
>>> changes to fix move_freepages() not moving free pages partially in the
>>> range.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>
>> This makes the code much cleaner, thank you!
>>
>>> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:55:18 -0400
>>> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>>>  pages.
>>>
>>> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
>>> move_freepages(), to isolate free pages. But move_freepages() was not able
>>> to move free pages partially covered by the specified range, leaving a race
>>> window open[1]. Fix it by teaching move_freepages() to split a free page
>>> when only part of it is going to be moved.
>>>
>>> In addition, when a >pageblock_order free page is moved, only its first
>>> pageblock migratetype is changed. It can cause warnings later. Fix it by
>>> set all pageblocks in a free page to the same migratetype after move.
>>>
>>> split_free_page() is changed to be used in move_freepages() and
>>> isolate_single_pageblock(). A common code to find the start pfn of a free
>>> page is refactored in get_freepage_start_pfn().
>>>
>>> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>>>  mm/page_isolation.c | 17 +++++++---
>>>  2 files changed, 73 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 7c41cb5d8a36..3fd5ab40b55c 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>>>  	struct zone *zone = page_zone(free_page);
>>>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>>>  	unsigned long pfn;
>>> -	unsigned long flags;
>>>  	int free_page_order;
>>>  	int mt;
>>>  	int ret = 0;
>>>
>>> -	if (split_pfn_offset == 0)
>>> -		return ret;
>>> +	/* zone lock should be held when this function is called */
>>> +	lockdep_assert_held(&zone->lock);
>>>
>>> -	spin_lock_irqsave(&zone->lock, flags);
>>> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
>>> +		return ret;
>>>
>>>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>>>  		ret = -ENOENT;
>>> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>>>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>>>  	}
>>>  out:
>>> -	spin_unlock_irqrestore(&zone->lock, flags);
>>>  	return ret;
>>>  }
>>>  /*
>>> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>>  					unsigned int order) { return NULL; }
>>>  #endif
>>>
>>> +/*
>>> + * Get first pfn of the free page, where pfn is in. If this free page does
>>> + * not exist, return the given pfn.
>>> + */
>>> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
>>> +{
>>> +	int order = 0;
>>> +	unsigned long start_pfn = pfn;
>>> +
>>> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
>>> +		if (++order > MAX_ORDER) {
>>> +			start_pfn = pfn;
>>> +			break;
>>> +		}
>>> +		start_pfn &= ~0UL << order;
>>> +	}
>>> +	return start_pfn;
>>> +}
>>> +
>>>  /*
>>>   * Move the free pages in a range to the freelist tail of the requested type.
>>>   * Note that start_page and end_pages are not aligned on a pageblock
>>> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>>  {
>>>  	struct page *page;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, pfn2;
>>>  	unsigned int order;
>>>  	int pages_moved = 0;
>>> +	unsigned long mt_change_pfn = start_pfn;
>>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>>> +
>>> +	/* split at start_pfn if it is in the middle of a free page */
>>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>>> +		int new_page_order = buddy_order(new_page);
>>> +
>>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>>> +			/* change migratetype so that split_free_page can work */
>>> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +			split_free_page(new_page, buddy_order(new_page),
>>> +					start_pfn - new_start_pfn);
>>> +
>>> +			mt_change_pfn = start_pfn;
>>> +			/* move to next page */
>>> +			start_pfn = new_start_pfn + (1 << new_page_order);
>>> +		}
>>> +	}
>>
>> Ok, so if there is a straddle from the previous block into our block
>> of interest, it's split and the migratetype is set only on our block.
>
> Correct. For example, start_pfn is 0x200 (2MB) and the free page starting from 0x0 is order-10 (4MB).
>
>>
>>> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>
>>>  		order = buddy_order(page);
>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>> +		/*
>>> +		 * set page migratetype for all pageblocks within the page and
>>> +		 * only after we move all free pages in one pageblock
>>> +		 */
>>> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
>>> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
>>> +			     pfn2 += pageblock_nr_pages) {
>>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>>> +							  new_mt);
>>> +				mt_change_pfn = pfn2;
>>> +			}
>>
>> But if we have the first block of a MAX_ORDER chunk, then we don't
>> split but rather move the whole chunk and make sure to update the
>> chunk's blocks that are outside the range of interest.
>>
>> It looks like either way would work, but why not split here as well
>> and keep the move contained to the block? Wouldn't this be a bit more
>> predictable and easier to understand?
>
> Yes, having a split here would be consistent.
>
> Also I want to spell out the corner case I am handling here (and I will add
> it to the comment): since move_to_free_list() checks page's migratetype
> with old_mt and changing one page' migratetype affects all pages within
> the same pageblock, if we are moving more than one free pages that are
> in the same pageblock, setting migratetype right after move_to_free_list()
> triggers the warning.
>
>>> +		}
>>>  		pfn += 1 << order;
>>>  		pages_moved += 1 << order;
>>>  	}
>>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +	/* set migratetype for the remaining pageblocks */
>>> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
>>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>>
>> I think I'm missing something for this.
>>
>> - If there was no straddle, there is only our block of interest to
>>   update.
>>
>> - If there was a straddle from the previous block, it was split and
>>   the block of interest was already updated. Nothing to do here?
>>
>> - If there was a straddle into the next block, both blocks are updated
>>   to the new type. Nothing to do here?
>>
>> What's the case where there are multiple blocks to update in the end?
>
> When a pageblock has free pages at the beginning and in-use pages at the end.
> The pageblock migratetype is not changed in the for loop above, since free
> pages do not cross pageblock boundary. But these free pages are moved
> to a new mt free list and will trigger warnings later.
>
> Also if multiple pageblocks are filled with only in-use pages, the for loop
> does nothing either. Their pageblocks will be set at this moment. I notice
> it might be a change of behavior as I am writing, but this change might
> be better. Before, in-page migrateype might or might not be changed,
> depending on if there is a free page in the same pageblock or not, meaning
> there will be migratetype holes in the specified range. Now the whole range
> is changed to new_mt. Let me know if you have a different opinion.
>
>
>>> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>  			int order = buddy_order(page);
>>>
>>>  			if (pfn + (1UL << order) > boundary_pfn) {
>>> +				int res;
>>> +				unsigned long flags;
>>> +
>>> +				spin_lock_irqsave(&zone->lock, flags);
>>> +				res = split_free_page(page, order, boundary_pfn - pfn);
>>> +				spin_unlock_irqrestore(&zone->lock, flags);
>>> +
>>>  				/* free page changed before split, check it again */
>>> -				if (split_free_page(page, order, boundary_pfn - pfn))
>>> +				if (res)
>>>  					continue;
>>
>> At this point, we've already set the migratetype, which has handled
>> straddling free pages. Is this split still needed?
>
> Good point. I will remove it. Originally, I thought it should stay to handle
> the free page coming from the migration below. But unless a greater than pageblock
> order in-use page shows up in the system and it is freed directly via __free_pages(),
> any free page coming from the migration below should be put in the right
> free list.
>
> Such > pageblock order pages are possible, only if we have >PMD order THPs
> or __PageMovable. IIRC, both do not exist yet.
>
>>
>>> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>  				/*
>>>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>>>  				 * no one else can grab the freed page after migration.
>>> -				 * Ideally, the page should be freed as two separate
>>> -				 * pages to be added into separate migratetype free
>>> -				 * lists.
>>> +				 * The page should be freed into separate migratetype
>>> +				 * free lists, unless the free page order is greater
>>> +				 * than pageblock order. It is not the case now,
>>> +				 * since gigantic hugetlb is freed as order-0
>>> +				 * pages and LRU pages do not cross pageblocks.
>>>  				 */
>>>  				if (isolate_page) {
>>>  					ret = set_migratetype_isolate(page, page_mt,
>>
>> I hadn't thought about LRU pages being constrained to single
>> pageblocks before. Does this mean we only ever migrate here in case
>
> Initially, I thought a lot about what if a high order folio crosses
> two adjacent pageblocks, but at the end I find that __find_buddy_pfn()
> does not treat pfns from adjacent pageblocks as buddy unless order
> is greater than pageblock order. So any high order folio from
> buddy allocator does not cross pageblocks. That is a relief.
>
> Another (future) possibility is once anon large folio is merged and
> my split huge page to any lower order patches are merged, a high order
> folio might not come directly from buddy allocator but from a huge page
> split. But that requires a > pageblock order folio exist first, which
> is not possible either. So we are good.
>
>> there is a movable gigantic page? And since those are already split
>> during the free, does that mean the "reset pfn to head of the free
>> page" part after the migration is actually unnecessary?
>
> Yes. the "reset pfn" code could be removed.
>
> Thank you for the review. Really appreciate it. Let me revise my
> patch 3 and send it out again.

It turns out that there was a bug in my patch 2: set_pageblock_migratetype()
is used by isolated_page case too, thus cannot be removed unconditionally.

I attached my revised patch 2 and 3 (with all the suggestions above).


--
Best Regards,
Yan, Zi

[-- Attachment #1.2: v2-0001-mm-set-migratetype-after-free-pages-are-moved-bet.patch --]
[-- Type: text/plain, Size: 3462 bytes --]

From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:27:14 -0400
Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
 free lists.

This avoids changing migratetype after move_freepages() or
move_freepages_block(), which is error prone. It also prepares for upcoming
changes to fix move_freepages() not moving free pages partially in the
range.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 10 +++-------
 mm/page_isolation.c |  7 +++----
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d839311d7c6e..928bb595d7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1617,6 +1617,7 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
+	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
 
 	return pages_moved;
 }
@@ -1838,7 +1839,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
-		set_pageblock_migratetype(page, start_type);
 		block_type = start_type;
 	}
 
@@ -1910,7 +1910,6 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	if (migratetype_is_mergeable(mt)) {
 		if (move_freepages_block(zone, page,
 					 mt, MIGRATE_HIGHATOMIC) != -1) {
-			set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 			zone->nr_reserved_highatomic += pageblock_nr_pages;
 		}
 	}
@@ -1995,7 +1994,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
-			set_pageblock_migratetype(page, ac->migratetype);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return ret;
@@ -2607,10 +2605,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
-			if (migratetype_is_mergeable(mt) &&
-			    move_freepages_block(zone, page, mt,
-						 MIGRATE_MOVABLE) != -1)
-				set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+			if (migratetype_is_mergeable(mt))
+			    move_freepages_block(zone, page, mt, MIGRATE_MOVABLE);
 		}
 	}
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..5f8c658c0853 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -187,7 +187,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -261,10 +260,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		 * should not fail on zone boundaries.
 		 */
 		WARN_ON_ONCE(nr_pages == -1);
-	}
-	set_pageblock_migratetype(page, migratetype);
-	if (isolated_page)
+	} else {
+		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
+	}
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
-- 
2.40.1


[-- Attachment #1.3: v2-0002-mm-enable-move_freepages-to-properly-move-part-of.patch --]
[-- Type: text/plain, Size: 9047 bytes --]

From 1734bb24a38052f13e3f2ddb26b82aa043638c95 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH v2 2/2] mm: enable move_freepages() to properly move part of
 free pages.

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by teaching move_freepages() to split a free page
when only part of it is going to be moved.

In addition, when a >pageblock_order free page is moved, only its first
pageblock migratetype is changed. It can cause warnings later. Fix it by
set all pageblocks in a free page to the same migratetype after move.

split_free_page() is changed to be used in move_freepages() and
isolate_single_pageblock(). A common code to find the start pfn of a free
page is refactored in get_freepage_start_pfn().

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 94 +++++++++++++++++++++++++++++++++++++--------
 mm/page_isolation.c | 38 +++++-------------
 2 files changed, 88 insertions(+), 44 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..a86025f5e80a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,15 +865,15 @@ int split_free_page(struct page *free_page,
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -899,7 +899,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1588,6 +1587,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Get first pfn of the free page, where pfn is in. If this free page does
+ * not exist, return the given pfn.
+ */
+static unsigned long get_freepage_start_pfn(unsigned long pfn)
+{
+	int order = 0;
+	unsigned long start_pfn = pfn;
+
+	while (!PageBuddy(pfn_to_page(start_pfn))) {
+		if (++order > MAX_ORDER) {
+			start_pfn = pfn;
+			break;
+		}
+		start_pfn &= ~0UL << order;
+	}
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
 	struct page *page;
-	unsigned long pfn;
+	unsigned long pfn, pfn2;
 	unsigned int order;
 	int pages_moved = 0;
+	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
+	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
+
+	/* split at start_pfn if it is in the middle of a free page */
+	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
+		struct page *new_page = pfn_to_page(new_start_pfn);
+		int new_page_order = buddy_order(new_page);
+
+		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
+			/* change migratetype so that split_free_page can work */
+			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+			split_free_page(new_page, buddy_order(new_page),
+					start_pfn - new_start_pfn);
+
+			mt_changed_pfn = start_pfn;
+			/* move to next page */
+			start_pfn = new_start_pfn + (1 << new_page_order);
+		}
+	}
+
 
 	for (pfn = start_pfn; pfn <= end_pfn;) {
 		page = pfn_to_page(pfn);
@@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 
 		order = buddy_order(page);
 		move_to_free_list(page, zone, order, old_mt, new_mt);
+		/*
+		 * set page migratetype 1) only after we move all free pages in
+		 * one pageblock and 2) for all pageblocks within the page.
+		 *
+		 * for 1), since move_to_free_list() checks page migratetype with
+		 * old_mt and changing one page migratetype affects all pages
+		 * within the same pageblock, if we are moving more than
+		 * one free pages in the same pageblock, setting migratetype
+		 * right after first move_to_free_list() triggers the warning
+		 * in the following move_to_free_list().
+		 *
+		 * for 2), when a free page order is greater than pageblock_order,
+		 * all pageblocks within the free page need to be changed after
+		 * move_to_free_list().
+		 */
+		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
+			for (pfn2 = pfn;
+			     pfn2 < min_t(unsigned long,
+					  pfn + (1 << order),
+					  end_pfn + 1);
+			     pfn2 += pageblock_nr_pages) {
+				set_pageblock_migratetype(pfn_to_page(pfn2),
+							  new_mt);
+				mt_changed_pfn = pfn2;
+			}
+			/* split the free page if it goes beyond the specified range */
+			if (pfn + (1 << order) > (end_pfn + 1))
+				split_free_page(page, order, end_pfn + 1 - pfn);
+		}
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+	/* set migratetype for the remaining pageblocks */
+	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
+	     pfn2 <= end_pfn;
+	     pfn2 += pageblock_nr_pages)
+		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
 
 	return pages_moved;
 }
@@ -6213,14 +6284,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
+	outer_start = get_freepage_start_pfn(start);
 
 	if (outer_start != start) {
 		order = buddy_order(pfn_to_page(outer_start));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..e053386f5e3a 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -380,11 +380,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+				"a free page sits across isolation boundary");
 
 			pfn += 1UL << order;
 			continue;
@@ -408,8 +405,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			 * can be migrated. Otherwise, fail the isolation.
 			 */
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
 				int page_mt = get_pageblock_migratetype(page);
 				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
@@ -427,9 +422,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +448,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
-- 
2.40.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-02 11:43                                         ` David Hildenbrand
@ 2023-10-03  2:35                                           ` Zi Yan
  0 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-10-03  2:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4623 bytes --]

On 2 Oct 2023, at 7:43, David Hildenbrand wrote:

>>>> I can do it after I fix this. That change might or might not help only if we make
>>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>>> overwrite existing migratetype, the code might not need to split a page and move
>>>> it to MIGRATE_ISOLATE freelist?
>>>
>>> Did someone test how memory offlining plays along with that? (I can try myself
>>> within the next 1-2 weeks)
>>>
>>> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
>>> though.
>>>
>>> ret = start_isolate_page_range(start_pfn, end_pfn,
>>> 			       MIGRATE_MOVABLE,
>>> 			       MEMORY_OFFLINE | REPORT_FAILURE,
>>> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>>
>> Since a full MAX_ORDER range is passed, no free page split will happen.
>
> Okay, thanks for verifying that it should not be affected!
>
>>
>>>
>>>>
>>>> The fundamental issue in alloc_contig_range() is that to work at
>>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>>> now checks first pageblock migratetype, so such a page needs to be removed
>>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>>> before free pages are removed from their free lists (the stage after isolation).
>>>
>>> One idea was to always isolate larger chunks, and handle movability checks/split/etc
>>> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
>>> the could have been easier to handle (especially some corner cases I had in mind back then).
>>
>> I think it is a good idea. When I coded alloc_contig_range() up, I tried to
>> accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
>> If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
>> ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
>> Let me give this a try.
>>
>
> But again, just some thought I had back then, maybe it doesn't help for anything; I found more time to look into the whole thing in more detail.

Sure. The devil is in the details, but I will only know the details and what works
after I code it up. :)

>>>
>>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>>> in their original migratetype and check migratetype before allocating a page,
>>>> that might help. But that might add extra work (e.g., splitting a partially
>>>> isolated free page before allocation) in the really hot code path, which is not
>>>> desirable.
>>>
>>> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
>>> separate isolate list, but one per "proper migratetype". But again, just some random
>>> thoughts I had back then, I never had sufficient time to think it all through.
>>
>> Got it. I will think about it.
>>
>> One question on separate MIGRATE_ISOLATE:
>>
>> the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
>> bit instead of being one of migratetype. But now there are 5 migratetypes +
>
> Exactly what I was concerned about back then ...
>
>> MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
>> But current migratetype implementation is a word-based operation, requiring
>> NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
>> needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
>
> ... until I did the math. Let's assume a pageblock is 2 MiB.
>
> 4/(2* 1024 * 1024 * 8) = 0,00002384185791016 %
>
> 8/(2* 1024 * 1024 * 8) -> 1 / (2* 1024 * 1024) = 0,00004768371582031 %
>
> For a 1 TiB machine that means 256 KiB vs. 512 KiB
>
> I concluded that "wasting a lot of space" is not really the right word to describe that :)
>
> Just to put it into perspective, the memmap (64/4096) for a 1 TiB machine is ... 16 GiB.

You are right. I should have done the math. The absolute increase is not much.

>> An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
>> additional changes. Let me know if you have a better idea. Thanks.
>
> It would probably be cleanest to just use one byte per pageblock. That would cleanup the whole machinery eventually as well.

Let me give this a try and see if it cleans things up.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-03  2:26                                           ` Zi Yan
@ 2023-10-10 21:12                                             ` Johannes Weiner
  2023-10-11 15:25                                               ` Johannes Weiner
  2023-10-13  0:06                                               ` Zi Yan
  0 siblings, 2 replies; 83+ messages in thread
From: Johannes Weiner @ 2023-10-10 21:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

Hello!

On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> On 27 Sep 2023, at 22:51, Zi Yan wrote:
> I attached my revised patch 2 and 3 (with all the suggestions above).

Thanks! It took me a bit to read through them. It's a tricky codebase!

Some comments below.

> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:27:14 -0400
> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>  free lists.
> 
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for upcoming
> changes to fix move_freepages() not moving free pages partially in the
> range.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This is great and indeed makes the callsites much simpler. Thanks,
I'll fold this into the series.

> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  			  unsigned long end_pfn, int old_mt, int new_mt)
>  {
>  	struct page *page;
> -	unsigned long pfn;
> +	unsigned long pfn, pfn2;
>  	unsigned int order;
>  	int pages_moved = 0;
> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
> +
> +	/* split at start_pfn if it is in the middle of a free page */
> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
> +		struct page *new_page = pfn_to_page(new_start_pfn);
> +		int new_page_order = buddy_order(new_page);

get_freepage_start_pfn() returns start_pfn if it didn't find a large
buddy, so the buddy check shouldn't be necessary, right?

> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {

This *should* be implied according to the comments on
get_freepage_start_pfn(), but it currently isn't. Doing so would help
here, and seemingly also in alloc_contig_range().

How about this version of get_freepage_start_pfn()?

/*
 * Scan the range before this pfn for a buddy that straddles it
 */
static unsigned long find_straddling_buddy(unsigned long start_pfn)
{
	int order = 0;
	struct page *page;
	unsigned long pfn = start_pfn;

	while (!PageBuddy(page = pfn_to_page(pfn))) {
		/* Nothing found */
		if (++order > MAX_ORDER)
			return start_pfn;
		pfn &= ~0UL << order;
	}

	/*
	 * Found a preceding buddy, but does it straddle?
	 */
	if (pfn + (1 << buddy_order(page)) > start_pfn)
		return pfn;

	/* Nothing found */
	return start_pfn;
}

> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  
>  		order = buddy_order(page);
>  		move_to_free_list(page, zone, order, old_mt, new_mt);
> +		/*
> +		 * set page migratetype 1) only after we move all free pages in
> +		 * one pageblock and 2) for all pageblocks within the page.
> +		 *
> +		 * for 1), since move_to_free_list() checks page migratetype with
> +		 * old_mt and changing one page migratetype affects all pages
> +		 * within the same pageblock, if we are moving more than
> +		 * one free pages in the same pageblock, setting migratetype
> +		 * right after first move_to_free_list() triggers the warning
> +		 * in the following move_to_free_list().
> +		 *
> +		 * for 2), when a free page order is greater than pageblock_order,
> +		 * all pageblocks within the free page need to be changed after
> +		 * move_to_free_list().

I think this can be somewhat simplified.

There are two assumptions we can make. Buddies always consist of 2^n
pages. And buddies and pageblocks are naturally aligned. This means
that if this pageblock has the start of a buddy that straddles into
the next pageblock(s), it must be the first page in the block. That in
turn means we can move the handling before the loop.

If we split first, it also makes the loop a little simpler because we
know that any buddies that start inside this block cannot extend
beyond it (due to the alignment). The loop how it was originally
written can remain untouched.

> +		 */
> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
> +			for (pfn2 = pfn;
> +			     pfn2 < min_t(unsigned long,
> +					  pfn + (1 << order),
> +					  end_pfn + 1);
> +			     pfn2 += pageblock_nr_pages) {
> +				set_pageblock_migratetype(pfn_to_page(pfn2),
> +							  new_mt);
> +				mt_changed_pfn = pfn2;

Hm, this seems to assume that start_pfn to end_pfn can be more than
one block. Why is that? This function is only used on single blocks.

> +			}
> +			/* split the free page if it goes beyond the specified range */
> +			if (pfn + (1 << order) > (end_pfn + 1))
> +				split_free_page(page, order, end_pfn + 1 - pfn);
> +		}
>  		pfn += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +	/* set migratetype for the remaining pageblocks */
> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
> +	     pfn2 <= end_pfn;
> +	     pfn2 += pageblock_nr_pages)
> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);

If I rework the code on the above, I'm arriving at the following:

static int move_freepages(struct zone *zone, unsigned long start_pfn,
			  unsigned long end_pfn, int old_mt, int new_mt)
{
	struct page *start_page = pfn_to_page(start_pfn);
	int pages_moved = 0;
	unsigned long pfn;

	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);

	/*
	 * A free page may be comprised of 2^n blocks, which means our
	 * block of interest could be head or tail in such a page.
	 *
	 * If we're a tail, update the type of our block, then split
	 * the page into pageblocks. The splitting will do the leg
	 * work of sorting the blocks into the right freelists.
	 *
	 * If we're a head, split the page into pageblocks first. This
	 * ensures the migratetypes still match up during the freelist
	 * removal. Then do the regular scan for buddies in the block
	 * of interest, which will handle the rest.
	 *
	 * In theory, we could try to preserve 2^1 and larger blocks
	 * that lie outside our range. In practice, MAX_ORDER is
	 * usually one or two pageblocks anyway, so don't bother.
	 *
	 * Note that this only applies to page isolation, which calls
	 * this on random blocks in the pfn range! When we move stuff
	 * from inside the page allocator, the pages are coming off
	 * the freelist (can't be tail) and multi-block pages are
	 * handled directly in the stealing code (can't be a head).
	 */

	/* We're a tail */
	pfn = find_straddling_buddy(start_pfn);
	if (pfn != start_pfn) {
		struct page *free_page = pfn_to_page(pfn);

		set_pageblock_migratetype(start_page, new_mt);
		split_free_page(free_page, buddy_order(free_page),
				pageblock_nr_pages);
		return pageblock_nr_pages;
	}

	/* We're a head */
	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
		split_free_page(start_page, buddy_order(start_page),
				pageblock_nr_pages);

	/* Move buddies within the block */
	while (pfn <= end_pfn) {
		struct page *page = pfn_to_page(pfn);
		int order, nr_pages;

		if (!PageBuddy(page)) {
			pfn++;
			continue;
		}

		/* Make sure we are not inadvertently changing nodes */
		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
		VM_BUG_ON_PAGE(page_zone(page) != zone, page);

		order = buddy_order(page);
		nr_pages = 1 << order;

		move_to_free_list(page, zone, order, old_mt, new_mt);

		pfn += nr_pages;
		pages_moved += nr_pages;
	}

	set_pageblock_migratetype(start_page, new_mt);

	return pages_moved;
}

Does this look reasonable to you?

Note that the page isolation specific stuff comes first. If this code
holds up, we should be able to move it to page-isolation.c and keep it
out of the regular allocator path.

Thanks!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-10 21:12                                             ` Johannes Weiner
@ 2023-10-11 15:25                                               ` Johannes Weiner
  2023-10-11 15:45                                                 ` Johannes Weiner
  2023-10-13  0:06                                               ` Zi Yan
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-11 15:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> > @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
> >  
> >  		order = buddy_order(page);
> >  		move_to_free_list(page, zone, order, old_mt, new_mt);
> > +		/*
> > +		 * set page migratetype 1) only after we move all free pages in
> > +		 * one pageblock and 2) for all pageblocks within the page.
> > +		 *
> > +		 * for 1), since move_to_free_list() checks page migratetype with
> > +		 * old_mt and changing one page migratetype affects all pages
> > +		 * within the same pageblock, if we are moving more than
> > +		 * one free pages in the same pageblock, setting migratetype
> > +		 * right after first move_to_free_list() triggers the warning
> > +		 * in the following move_to_free_list().
> > +		 *
> > +		 * for 2), when a free page order is greater than pageblock_order,
> > +		 * all pageblocks within the free page need to be changed after
> > +		 * move_to_free_list().
> 
> I think this can be somewhat simplified.
> 
> There are two assumptions we can make. Buddies always consist of 2^n
> pages. And buddies and pageblocks are naturally aligned. This means
> that if this pageblock has the start of a buddy that straddles into
> the next pageblock(s), it must be the first page in the block. That in
> turn means we can move the handling before the loop.

Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(

So forget about my version of move_free_pages(). Only consider the
changes to find_straddling_buddy() and my question about multiple
blocks inside the requested range.

But I do have another question about your patch then. Say you have an
order-1 buddy that straddles into the block:

+       /* split at start_pfn if it is in the middle of a free page */
+       if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
+               struct page *new_page = pfn_to_page(new_start_pfn);
+               int new_page_order = buddy_order(new_page);
+
+               if (new_start_pfn + (1 << new_page_order) > start_pfn) {
+                       /* change migratetype so that split_free_page can work */
+                       set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+                       split_free_page(new_page, buddy_order(new_page),
+                                       start_pfn - new_start_pfn);
+
+                       mt_changed_pfn = start_pfn;
+                       /* move to next page */
+                       start_pfn = new_start_pfn + (1 << new_page_order);
+               }
+       }

this will have changed the type of the block to new_mt.

But then the buddy scan will do this:

                move_to_free_list(page, zone, order, old_mt, new_mt);
+               /*
+                * set page migratetype 1) only after we move all free pages in
+                * one pageblock and 2) for all pageblocks within the page.
+                *
+                * for 1), since move_to_free_list() checks page migratetype with
+                * old_mt and changing one page migratetype affects all pages
+                * within the same pageblock, if we are moving more than
+                * one free pages in the same pageblock, setting migratetype
+                * right after first move_to_free_list() triggers the warning
+                * in the following move_to_free_list().
+                *
+                * for 2), when a free page order is greater than pageblock_order,
+                * all pageblocks within the free page need to be changed after
+                * move_to_free_list().

That move_to_free_list() will complain that the pages no longer match
old_mt, no?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-11 15:25                                               ` Johannes Weiner
@ 2023-10-11 15:45                                                 ` Johannes Weiner
  2023-10-11 15:57                                                   ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-11 15:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Wed, Oct 11, 2023 at 11:25:27AM -0400, Johannes Weiner wrote:
> On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
> > On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> > > @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
> > >  
> > >  		order = buddy_order(page);
> > >  		move_to_free_list(page, zone, order, old_mt, new_mt);
> > > +		/*
> > > +		 * set page migratetype 1) only after we move all free pages in
> > > +		 * one pageblock and 2) for all pageblocks within the page.
> > > +		 *
> > > +		 * for 1), since move_to_free_list() checks page migratetype with
> > > +		 * old_mt and changing one page migratetype affects all pages
> > > +		 * within the same pageblock, if we are moving more than
> > > +		 * one free pages in the same pageblock, setting migratetype
> > > +		 * right after first move_to_free_list() triggers the warning
> > > +		 * in the following move_to_free_list().
> > > +		 *
> > > +		 * for 2), when a free page order is greater than pageblock_order,
> > > +		 * all pageblocks within the free page need to be changed after
> > > +		 * move_to_free_list().
> > 
> > I think this can be somewhat simplified.
> > 
> > There are two assumptions we can make. Buddies always consist of 2^n
> > pages. And buddies and pageblocks are naturally aligned. This means
> > that if this pageblock has the start of a buddy that straddles into
> > the next pageblock(s), it must be the first page in the block. That in
> > turn means we can move the handling before the loop.
> 
> Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(

I apologize for the back and forth, but I think I had it right the
first time. Say we have order-0 frees at pfn 511 and 512. Those can't
merge because their order-0 buddies are 510 and 513 respectively. The
same keeps higher-order merges below block size within the pageblock.
So again, due to the pow2 alignment, the only way for a buddy to
straddle a pageblock boundary is if it's >pageblock_order.

Please double check me on this, because I've stared at your patches
and the allocator code long enough now to thoroughly confuse myself.

My proposal for the follow-up changes still stands for now.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-11 15:45                                                 ` Johannes Weiner
@ 2023-10-11 15:57                                                   ` Zi Yan
  0 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-10-11 15:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4348 bytes --]

On 11 Oct 2023, at 11:45, Johannes Weiner wrote:

> On Wed, Oct 11, 2023 at 11:25:27AM -0400, Johannes Weiner wrote:
>> On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
>>> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>>>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>>
>>>>  		order = buddy_order(page);
>>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>>> +		/*
>>>> +		 * set page migratetype 1) only after we move all free pages in
>>>> +		 * one pageblock and 2) for all pageblocks within the page.
>>>> +		 *
>>>> +		 * for 1), since move_to_free_list() checks page migratetype with
>>>> +		 * old_mt and changing one page migratetype affects all pages
>>>> +		 * within the same pageblock, if we are moving more than
>>>> +		 * one free pages in the same pageblock, setting migratetype
>>>> +		 * right after first move_to_free_list() triggers the warning
>>>> +		 * in the following move_to_free_list().
>>>> +		 *
>>>> +		 * for 2), when a free page order is greater than pageblock_order,
>>>> +		 * all pageblocks within the free page need to be changed after
>>>> +		 * move_to_free_list().
>>>
>>> I think this can be somewhat simplified.
>>>
>>> There are two assumptions we can make. Buddies always consist of 2^n
>>> pages. And buddies and pageblocks are naturally aligned. This means
>>> that if this pageblock has the start of a buddy that straddles into
>>> the next pageblock(s), it must be the first page in the block. That in
>>> turn means we can move the handling before the loop.
>>
>> Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(
>
> I apologize for the back and forth, but I think I had it right the
> first time. Say we have order-0 frees at pfn 511 and 512. Those can't
> merge because their order-0 buddies are 510 and 513 respectively. The
> same keeps higher-order merges below block size within the pageblock.
> So again, due to the pow2 alignment, the only way for a buddy to
> straddle a pageblock boundary is if it's >pageblock_order.
>
> Please double check me on this, because I've stared at your patches
> and the allocator code long enough now to thoroughly confuse myself.
>
> My proposal for the follow-up changes still stands for now.

Sure. I admit that current alloc_contig_range() code is too complicated
and I am going to refactor it.

find_straddling_buddy() looks good to me. You will this change in
alloc_contig_range() to replace get_freepage_start_pfn():

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a86025f5e80a..e8ed25c94863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,7 +6209,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
                       unsigned migratetype, gfp_t gfp_mask)
 {
        unsigned long outer_start, outer_end;
-       int order;
        int ret = 0;

        struct compact_control cc = {
@@ -6283,21 +6282,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
         * isolated thus they won't get removed from buddy.
         */

-       order = 0;
-       outer_start = get_freepage_start_pfn(start);
-
-       if (outer_start != start) {
-               order = buddy_order(pfn_to_page(outer_start));
-
-               /*
-                * outer_start page could be small order buddy page and
-                * it doesn't include start page. Adjust outer_start
-                * in this case to report failed page properly
-                * on tracepoint in test_pages_isolated()
-                */
-               if (outer_start + (1UL << order) <= start)
-                       outer_start = start;
-       }
+       /*
+        * outer_start page could be small order buddy page and it doesn't
+        * include start page. outer_start is set to start in
+        * find_straddling_buddy() to report failed page properly on tracepoint
+        * in test_pages_isolated()
+        */
+       outer_start = find_straddling_buddy(start);

        /* Make sure the range is really isolated. */
        if (test_pages_isolated(outer_start, end, 0)) {

Let me go through your move_freepages() in details and get back to you.

Thank you for the feedback!

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-10 21:12                                             ` Johannes Weiner
  2023-10-11 15:25                                               ` Johannes Weiner
@ 2023-10-13  0:06                                               ` Zi Yan
  2023-10-13 14:51                                                 ` Zi Yan
  1 sibling, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-13  0:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9073 bytes --]

On 10 Oct 2023, at 17:12, Johannes Weiner wrote:

> Hello!
>
> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>> On 27 Sep 2023, at 22:51, Zi Yan wrote:
>> I attached my revised patch 2 and 3 (with all the suggestions above).
>
> Thanks! It took me a bit to read through them. It's a tricky codebase!
>
> Some comments below.
>
>> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>>  free lists.
>>
>> This avoids changing migratetype after move_freepages() or
>> move_freepages_block(), which is error prone. It also prepares for upcoming
>> changes to fix move_freepages() not moving free pages partially in the
>> range.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> This is great and indeed makes the callsites much simpler. Thanks,
> I'll fold this into the series.
>
>> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>  {
>>  	struct page *page;
>> -	unsigned long pfn;
>> +	unsigned long pfn, pfn2;
>>  	unsigned int order;
>>  	int pages_moved = 0;
>> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>> +
>> +	/* split at start_pfn if it is in the middle of a free page */
>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>> +		int new_page_order = buddy_order(new_page);
>
> get_freepage_start_pfn() returns start_pfn if it didn't find a large
> buddy, so the buddy check shouldn't be necessary, right?
>
>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>
> This *should* be implied according to the comments on
> get_freepage_start_pfn(), but it currently isn't. Doing so would help
> here, and seemingly also in alloc_contig_range().
>
> How about this version of get_freepage_start_pfn()?
>
> /*
>  * Scan the range before this pfn for a buddy that straddles it
>  */
> static unsigned long find_straddling_buddy(unsigned long start_pfn)
> {
> 	int order = 0;
> 	struct page *page;
> 	unsigned long pfn = start_pfn;
>
> 	while (!PageBuddy(page = pfn_to_page(pfn))) {
> 		/* Nothing found */
> 		if (++order > MAX_ORDER)
> 			return start_pfn;
> 		pfn &= ~0UL << order;
> 	}
>
> 	/*
> 	 * Found a preceding buddy, but does it straddle?
> 	 */
> 	if (pfn + (1 << buddy_order(page)) > start_pfn)
> 		return pfn;
>
> 	/* Nothing found */
> 	return start_pfn;
> }
>
>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>
>>  		order = buddy_order(page);
>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>> +		/*
>> +		 * set page migratetype 1) only after we move all free pages in
>> +		 * one pageblock and 2) for all pageblocks within the page.
>> +		 *
>> +		 * for 1), since move_to_free_list() checks page migratetype with
>> +		 * old_mt and changing one page migratetype affects all pages
>> +		 * within the same pageblock, if we are moving more than
>> +		 * one free pages in the same pageblock, setting migratetype
>> +		 * right after first move_to_free_list() triggers the warning
>> +		 * in the following move_to_free_list().
>> +		 *
>> +		 * for 2), when a free page order is greater than pageblock_order,
>> +		 * all pageblocks within the free page need to be changed after
>> +		 * move_to_free_list().
>
> I think this can be somewhat simplified.
>
> There are two assumptions we can make. Buddies always consist of 2^n
> pages. And buddies and pageblocks are naturally aligned. This means
> that if this pageblock has the start of a buddy that straddles into
> the next pageblock(s), it must be the first page in the block. That in
> turn means we can move the handling before the loop.
>
> If we split first, it also makes the loop a little simpler because we
> know that any buddies that start inside this block cannot extend
> beyond it (due to the alignment). The loop how it was originally
> written can remain untouched.
>
>> +		 */
>> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
>> +			for (pfn2 = pfn;
>> +			     pfn2 < min_t(unsigned long,
>> +					  pfn + (1 << order),
>> +					  end_pfn + 1);
>> +			     pfn2 += pageblock_nr_pages) {
>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>> +							  new_mt);
>> +				mt_changed_pfn = pfn2;
>
> Hm, this seems to assume that start_pfn to end_pfn can be more than
> one block. Why is that? This function is only used on single blocks.

You are right. I made unnecessary assumptions when I wrote the code.

>
>> +			}
>> +			/* split the free page if it goes beyond the specified range */
>> +			if (pfn + (1 << order) > (end_pfn + 1))
>> +				split_free_page(page, order, end_pfn + 1 - pfn);
>> +		}
>>  		pfn += 1 << order;
>>  		pages_moved += 1 << order;
>>  	}
>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +	/* set migratetype for the remaining pageblocks */
>> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
>> +	     pfn2 <= end_pfn;
>> +	     pfn2 += pageblock_nr_pages)
>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>
> If I rework the code on the above, I'm arriving at the following:
>
> static int move_freepages(struct zone *zone, unsigned long start_pfn,
> 			  unsigned long end_pfn, int old_mt, int new_mt)
> {
> 	struct page *start_page = pfn_to_page(start_pfn);
> 	int pages_moved = 0;
> 	unsigned long pfn;
>
> 	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
> 	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>
> 	/*
> 	 * A free page may be comprised of 2^n blocks, which means our
> 	 * block of interest could be head or tail in such a page.
> 	 *
> 	 * If we're a tail, update the type of our block, then split
> 	 * the page into pageblocks. The splitting will do the leg
> 	 * work of sorting the blocks into the right freelists.
> 	 *
> 	 * If we're a head, split the page into pageblocks first. This
> 	 * ensures the migratetypes still match up during the freelist
> 	 * removal. Then do the regular scan for buddies in the block
> 	 * of interest, which will handle the rest.
> 	 *
> 	 * In theory, we could try to preserve 2^1 and larger blocks
> 	 * that lie outside our range. In practice, MAX_ORDER is
> 	 * usually one or two pageblocks anyway, so don't bother.
> 	 *
> 	 * Note that this only applies to page isolation, which calls
> 	 * this on random blocks in the pfn range! When we move stuff
> 	 * from inside the page allocator, the pages are coming off
> 	 * the freelist (can't be tail) and multi-block pages are
> 	 * handled directly in the stealing code (can't be a head).
> 	 */
>
> 	/* We're a tail */
> 	pfn = find_straddling_buddy(start_pfn);
> 	if (pfn != start_pfn) {
> 		struct page *free_page = pfn_to_page(pfn);
>
> 		set_pageblock_migratetype(start_page, new_mt);
> 		split_free_page(free_page, buddy_order(free_page),
> 				pageblock_nr_pages);
> 		return pageblock_nr_pages;
> 	}
>
> 	/* We're a head */
> 	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
> 		split_free_page(start_page, buddy_order(start_page),
> 				pageblock_nr_pages);

This actually can be:

/* We're a head */
if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
        set_pageblock_migratetype(start_page, new_mt);
        split_free_page(start_page, buddy_order(start_page),
                        pageblock_nr_pages);
        return pageblock_nr_pages;
}


>
> 	/* Move buddies within the block */
> 	while (pfn <= end_pfn) {
> 		struct page *page = pfn_to_page(pfn);
> 		int order, nr_pages;
>
> 		if (!PageBuddy(page)) {
> 			pfn++;
> 			continue;
> 		}
>
> 		/* Make sure we are not inadvertently changing nodes */
> 		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
> 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
> 		order = buddy_order(page);
> 		nr_pages = 1 << order;
>
> 		move_to_free_list(page, zone, order, old_mt, new_mt);
>
> 		pfn += nr_pages;
> 		pages_moved += nr_pages;
> 	}
>
> 	set_pageblock_migratetype(start_page, new_mt);
>
> 	return pages_moved;
> }
>
> Does this look reasonable to you?

Looks good to me. Thanks.

>
> Note that the page isolation specific stuff comes first. If this code
> holds up, we should be able to move it to page-isolation.c and keep it
> out of the regular allocator path.

You mean move the tail and head part to set_migratetype_isolate()?
And change move_freepages_block() to separate prep_move_freepages_block(),
the tail and head code, and move_freepages()? It should work and looks
like a similar code pattern as steal_suitable_fallback().


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-13  0:06                                               ` Zi Yan
@ 2023-10-13 14:51                                                 ` Zi Yan
  2023-10-16 13:35                                                   ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-13 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 9474 bytes --]

On 12 Oct 2023, at 20:06, Zi Yan wrote:

> On 10 Oct 2023, at 17:12, Johannes Weiner wrote:
>
>> Hello!
>>
>> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>>> On 27 Sep 2023, at 22:51, Zi Yan wrote:
>>> I attached my revised patch 2 and 3 (with all the suggestions above).
>>
>> Thanks! It took me a bit to read through them. It's a tricky codebase!
>>
>> Some comments below.
>>
>>> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>>> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>>>  free lists.
>>>
>>> This avoids changing migratetype after move_freepages() or
>>> move_freepages_block(), which is error prone. It also prepares for upcoming
>>> changes to fix move_freepages() not moving free pages partially in the
>>> range.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>
>> This is great and indeed makes the callsites much simpler. Thanks,
>> I'll fold this into the series.
>>
>>> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>>  {
>>>  	struct page *page;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, pfn2;
>>>  	unsigned int order;
>>>  	int pages_moved = 0;
>>> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
>>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>>> +
>>> +	/* split at start_pfn if it is in the middle of a free page */
>>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>>> +		int new_page_order = buddy_order(new_page);
>>
>> get_freepage_start_pfn() returns start_pfn if it didn't find a large
>> buddy, so the buddy check shouldn't be necessary, right?
>>
>>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>>
>> This *should* be implied according to the comments on
>> get_freepage_start_pfn(), but it currently isn't. Doing so would help
>> here, and seemingly also in alloc_contig_range().
>>
>> How about this version of get_freepage_start_pfn()?
>>
>> /*
>>  * Scan the range before this pfn for a buddy that straddles it
>>  */
>> static unsigned long find_straddling_buddy(unsigned long start_pfn)
>> {
>> 	int order = 0;
>> 	struct page *page;
>> 	unsigned long pfn = start_pfn;
>>
>> 	while (!PageBuddy(page = pfn_to_page(pfn))) {
>> 		/* Nothing found */
>> 		if (++order > MAX_ORDER)
>> 			return start_pfn;
>> 		pfn &= ~0UL << order;
>> 	}
>>
>> 	/*
>> 	 * Found a preceding buddy, but does it straddle?
>> 	 */
>> 	if (pfn + (1 << buddy_order(page)) > start_pfn)
>> 		return pfn;
>>
>> 	/* Nothing found */
>> 	return start_pfn;
>> }
>>
>>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>
>>>  		order = buddy_order(page);
>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>> +		/*
>>> +		 * set page migratetype 1) only after we move all free pages in
>>> +		 * one pageblock and 2) for all pageblocks within the page.
>>> +		 *
>>> +		 * for 1), since move_to_free_list() checks page migratetype with
>>> +		 * old_mt and changing one page migratetype affects all pages
>>> +		 * within the same pageblock, if we are moving more than
>>> +		 * one free pages in the same pageblock, setting migratetype
>>> +		 * right after first move_to_free_list() triggers the warning
>>> +		 * in the following move_to_free_list().
>>> +		 *
>>> +		 * for 2), when a free page order is greater than pageblock_order,
>>> +		 * all pageblocks within the free page need to be changed after
>>> +		 * move_to_free_list().
>>
>> I think this can be somewhat simplified.
>>
>> There are two assumptions we can make. Buddies always consist of 2^n
>> pages. And buddies and pageblocks are naturally aligned. This means
>> that if this pageblock has the start of a buddy that straddles into
>> the next pageblock(s), it must be the first page in the block. That in
>> turn means we can move the handling before the loop.
>>
>> If we split first, it also makes the loop a little simpler because we
>> know that any buddies that start inside this block cannot extend
>> beyond it (due to the alignment). The loop how it was originally
>> written can remain untouched.
>>
>>> +		 */
>>> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
>>> +			for (pfn2 = pfn;
>>> +			     pfn2 < min_t(unsigned long,
>>> +					  pfn + (1 << order),
>>> +					  end_pfn + 1);
>>> +			     pfn2 += pageblock_nr_pages) {
>>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>>> +							  new_mt);
>>> +				mt_changed_pfn = pfn2;
>>
>> Hm, this seems to assume that start_pfn to end_pfn can be more than
>> one block. Why is that? This function is only used on single blocks.
>
> You are right. I made unnecessary assumptions when I wrote the code.
>
>>
>>> +			}
>>> +			/* split the free page if it goes beyond the specified range */
>>> +			if (pfn + (1 << order) > (end_pfn + 1))
>>> +				split_free_page(page, order, end_pfn + 1 - pfn);
>>> +		}
>>>  		pfn += 1 << order;
>>>  		pages_moved += 1 << order;
>>>  	}
>>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +	/* set migratetype for the remaining pageblocks */
>>> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
>>> +	     pfn2 <= end_pfn;
>>> +	     pfn2 += pageblock_nr_pages)
>>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>>
>> If I rework the code on the above, I'm arriving at the following:
>>
>> static int move_freepages(struct zone *zone, unsigned long start_pfn,
>> 			  unsigned long end_pfn, int old_mt, int new_mt)
>> {
>> 	struct page *start_page = pfn_to_page(start_pfn);
>> 	int pages_moved = 0;
>> 	unsigned long pfn;
>>
>> 	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
>> 	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>>
>> 	/*
>> 	 * A free page may be comprised of 2^n blocks, which means our
>> 	 * block of interest could be head or tail in such a page.
>> 	 *
>> 	 * If we're a tail, update the type of our block, then split
>> 	 * the page into pageblocks. The splitting will do the leg
>> 	 * work of sorting the blocks into the right freelists.
>> 	 *
>> 	 * If we're a head, split the page into pageblocks first. This
>> 	 * ensures the migratetypes still match up during the freelist
>> 	 * removal. Then do the regular scan for buddies in the block
>> 	 * of interest, which will handle the rest.
>> 	 *
>> 	 * In theory, we could try to preserve 2^1 and larger blocks
>> 	 * that lie outside our range. In practice, MAX_ORDER is
>> 	 * usually one or two pageblocks anyway, so don't bother.
>> 	 *
>> 	 * Note that this only applies to page isolation, which calls
>> 	 * this on random blocks in the pfn range! When we move stuff
>> 	 * from inside the page allocator, the pages are coming off
>> 	 * the freelist (can't be tail) and multi-block pages are
>> 	 * handled directly in the stealing code (can't be a head).
>> 	 */
>>
>> 	/* We're a tail */
>> 	pfn = find_straddling_buddy(start_pfn);
>> 	if (pfn != start_pfn) {
>> 		struct page *free_page = pfn_to_page(pfn);
>>
>> 		set_pageblock_migratetype(start_page, new_mt);
>> 		split_free_page(free_page, buddy_order(free_page),
>> 				pageblock_nr_pages);
>> 		return pageblock_nr_pages;
>> 	}
>>
>> 	/* We're a head */
>> 	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
>> 		split_free_page(start_page, buddy_order(start_page),
>> 				pageblock_nr_pages);
>
> This actually can be:
>
> /* We're a head */
> if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
>         set_pageblock_migratetype(start_page, new_mt);
>         split_free_page(start_page, buddy_order(start_page),
>                         pageblock_nr_pages);
>         return pageblock_nr_pages;
> }
>
>
>>
>> 	/* Move buddies within the block */
>> 	while (pfn <= end_pfn) {
>> 		struct page *page = pfn_to_page(pfn);
>> 		int order, nr_pages;
>>
>> 		if (!PageBuddy(page)) {
>> 			pfn++;
>> 			continue;
>> 		}
>>
>> 		/* Make sure we are not inadvertently changing nodes */
>> 		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
>> 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>
>> 		order = buddy_order(page);
>> 		nr_pages = 1 << order;
>>
>> 		move_to_free_list(page, zone, order, old_mt, new_mt);
>>
>> 		pfn += nr_pages;
>> 		pages_moved += nr_pages;
>> 	}
>>
>> 	set_pageblock_migratetype(start_page, new_mt);
>>
>> 	return pages_moved;
>> }
>>
>> Does this look reasonable to you?
>
> Looks good to me. Thanks.
>
>>
>> Note that the page isolation specific stuff comes first. If this code
>> holds up, we should be able to move it to page-isolation.c and keep it
>> out of the regular allocator path.
>
> You mean move the tail and head part to set_migratetype_isolate()?
> And change move_freepages_block() to separate prep_move_freepages_block(),
> the tail and head code, and move_freepages()? It should work and looks
> like a similar code pattern as steal_suitable_fallback().

The attached patch has all the suggested changes, let me know how it
looks to you. Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #1.2: 0001-mm-page_isolation-split-cross-pageblock-free-pages-d.patch --]
[-- Type: text/plain, Size: 12323 bytes --]

From 32e7aefe352785b29b31b72ce0bb8b4e608860ca Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH] mm/page_isolation: split cross-pageblock free pages during
 isolation

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by splitting such pages before calling
move_freepages().

A common code to find the start pfn of a free page straddling a given pfn
is refactored in find_straddling_buddy().

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |  7 +++
 mm/page_alloc.c                | 94 ++++++++++++++++++++--------------
 mm/page_isolation.c            | 90 ++++++++++++++++++++------------
 3 files changed, 121 insertions(+), 70 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 901915747960..4873f1a41792 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,8 +34,15 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
+unsigned long find_straddling_buddy(unsigned long start_pfn);
 int move_freepages_block(struct zone *zone, struct page *page,
 			 int old_mt, int new_mt);
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
+				      unsigned long *start_pfn,
+				      unsigned long *end_pfn,
+				      int *num_free, int *num_movable);
+int move_freepages(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..74831a86f41d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,15 +865,15 @@ int split_free_page(struct page *free_page,
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -899,7 +899,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1588,21 +1587,52 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Scan the range before this pfn for a buddy that straddles it
+ */
+unsigned long find_straddling_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
+int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
-	struct page *page;
-	unsigned long pfn;
-	unsigned int order;
+	struct page *start_page = pfn_to_page(start_pfn);
 	int pages_moved = 0;
+	unsigned long pfn = start_pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/* Move buddies within the block */
+	while (pfn <= end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+		int order, nr_pages;
 
-	for (pfn = start_pfn; pfn <= end_pfn;) {
-		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			pfn++;
 			continue;
@@ -1613,16 +1643,20 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = buddy_order(page);
+		nr_pages = 1 << order;
+
 		move_to_free_list(page, zone, order, old_mt, new_mt);
-		pfn += 1 << order;
-		pages_moved += 1 << order;
+
+		pfn += nr_pages;
+		pages_moved += nr_pages;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+
+	set_pageblock_migratetype(start_page, new_mt);
 
 	return pages_moved;
 }
 
-static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
 				      unsigned long *start_pfn,
 				      unsigned long *end_pfn,
 				      int *num_free, int *num_movable)
@@ -6138,7 +6172,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6212,28 +6245,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * isolated thus they won't get removed from buddy.
 	 */
 
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	/*
+	 * outer_start page could be small order buddy page and it doesn't
+	 * include start page. outer_start is set to start in
+	 * find_straddling_buddy() to report failed page properly on tracepoint
+	 * in test_pages_isolated()
+	 */
+	outer_start = find_straddling_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..c6a4e02ed588 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,15 +178,61 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
+		unsigned long start_pfn, end_pfn, free_page_pfn;
+		struct page *start_page;
 
-		nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
 		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
+
+		/*
+		 * A free page may be comprised of 2^n blocks, which means our
+		 * block of interest could be head or tail in such a page.
+		 *
+		 * If we're a tail, update the type of our block, then split
+		 * the page into pageblocks. The splitting will do the leg
+		 * work of sorting the blocks into the right freelists.
+		 *
+		 * If we're a head, split the page into pageblocks first. This
+		 * ensures the migratetypes still match up during the freelist
+		 * removal. Then do the regular scan for buddies in the block
+		 * of interest, which will handle the rest.
+		 *
+		 * In theory, we could try to preserve 2^1 and larger blocks
+		 * that lie outside our range. In practice, MAX_ORDER is
+		 * usually one or two pageblocks anyway, so don't bother.
+		 *
+		 * Note that this only applies to page isolation, which calls
+		 * this on random blocks in the pfn range! When we move stuff
+		 * from inside the page allocator, the pages are coming off
+		 * the freelist (can't be tail) and multi-block pages are
+		 * handled directly in the stealing code (can't be a head).
+		 */
+		start_page = pfn_to_page(start_pfn);
+
+		free_page_pfn = find_straddling_buddy(start_pfn);
+		/*
+		 * 1) We're a tail: free_page_pfn != start_pfn
+		 * 2) We're a head: free_page_pfn == start_pfn &&
+		 *		    PageBuddy(start_page) &&
+		 *		    buddy_order(start_page) > pageblock_order
+		 *
+		 * In both cases, the free page needs to be split.
+		 */
+		if (free_page_pfn != start_pfn ||
+		    (PageBuddy(start_page) &&
+		     buddy_order(start_page) > pageblock_order)) {
+			struct page *free_page = pfn_to_page(free_page_pfn);
+
+			set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
+			split_free_page(free_page, buddy_order(free_page),
+					pageblock_nr_pages);
+		} else
+			move_freepages(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE);
+
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -380,11 +426,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+				"a free page sits across isolation boundary");
 
 			pfn += 1UL << order;
 			continue;
@@ -408,8 +451,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			 * can be migrated. Otherwise, fail the isolation.
 			 */
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
 				int page_mt = get_pageblock_migratetype(page);
 				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
@@ -427,9 +468,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +494,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
-- 
2.42.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-13 14:51                                                 ` Zi Yan
@ 2023-10-16 13:35                                                   ` Zi Yan
  2023-10-16 14:37                                                     ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-16 13:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 199 bytes --]

> The attached patch has all the suggested changes, let me know how it
> looks to you. Thanks.

The one I sent has free page accounting issues. The attached one fixes them.

--
Best Regards,
Yan, Zi

[-- Attachment #1.2: v2-0001-mm-page_isolation-split-cross-pageblock-free-page.patch --]
[-- Type: text/plain, Size: 15493 bytes --]

From b428b4919e30dc0556406325d3c173a87f45f135 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH v2] mm/page_isolation: split cross-pageblock free pages during
 isolation

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by splitting such pages before calling
move_freepages().

A common code to find the start pfn of a free page straddling a given pfn
is refactored in find_straddling_buddy(). split_free_page() is modified
to change pageblock migratetype inside the function.

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |  12 +++-
 mm/internal.h                  |   3 -
 mm/page_alloc.c                | 103 ++++++++++++++++++------------
 mm/page_isolation.c            | 113 ++++++++++++++++++++++-----------
 4 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 901915747960..e82ab67867df 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,9 +33,17 @@ static inline bool is_migrate_isolate(int migratetype)
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
 
+unsigned long find_straddling_buddy(unsigned long start_pfn);
+int split_free_page(struct page *free_page,
+			unsigned int order, unsigned long split_pfn_offset,
+			int mt1, int mt2);
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page,
-			 int old_mt, int new_mt);
+int move_freepages(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt);
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
+				      unsigned long *start_pfn,
+				      unsigned long *end_pfn,
+				      int *num_free, int *num_movable);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/internal.h b/mm/internal.h
index 8c90e966e9f8..cda702359c0f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -457,9 +457,6 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
 		unsigned long, enum meminit_context, struct vmem_altmap *, int);
 
 
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..e877fbdb700e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -851,6 +851,8 @@ static inline void __free_one_page(struct page *page,
  * @free_page:		the original free page
  * @order:		the order of the page
  * @split_pfn_offset:	split offset within the page
+ * @mt1:		migratetype set before the offset
+ * @mt2:		migratetype set after the offset
  *
  * Return -ENOENT if the free page is changed, otherwise 0
  *
@@ -860,20 +862,21 @@ static inline void __free_one_page(struct page *page,
  * nothing.
  */
 int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset)
+			unsigned int order, unsigned long split_pfn_offset,
+			int mt1, int mt2)
 {
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
 	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
 	del_page_from_free_list(free_page, zone, order, mt);
 
+	set_pageblock_migratetype(free_page, mt1);
+	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
+				  mt2);
+
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
@@ -899,7 +906,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1588,21 +1594,52 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Scan the range before this pfn for a buddy that straddles it
+ */
+unsigned long find_straddling_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
+int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
-	struct page *page;
-	unsigned long pfn;
-	unsigned int order;
+	struct page *start_page = pfn_to_page(start_pfn);
 	int pages_moved = 0;
+	unsigned long pfn = start_pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/* Move buddies within the block */
+	while (pfn <= end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+		int order, nr_pages;
 
-	for (pfn = start_pfn; pfn <= end_pfn;) {
-		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			pfn++;
 			continue;
@@ -1613,16 +1650,20 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = buddy_order(page);
+		nr_pages = 1 << order;
+
 		move_to_free_list(page, zone, order, old_mt, new_mt);
-		pfn += 1 << order;
-		pages_moved += 1 << order;
+
+		pfn += nr_pages;
+		pages_moved += nr_pages;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+
+	set_pageblock_migratetype(start_page, new_mt);
 
 	return pages_moved;
 }
 
-static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
 				      unsigned long *start_pfn,
 				      unsigned long *end_pfn,
 				      int *num_free, int *num_movable)
@@ -6138,7 +6179,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6212,28 +6252,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * isolated thus they won't get removed from buddy.
 	 */
 
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	/*
+	 * outer_start page could be small order buddy page and it doesn't
+	 * include start page. outer_start is set to start in
+	 * find_straddling_buddy() to report failed page properly on tracepoint
+	 * in test_pages_isolated()
+	 */
+	outer_start = find_straddling_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..0500dff477f8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 	return NULL;
 }
 
+/*
+ * additional steps for moving free pages during page isolation
+ */
+static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt)
+{
+	struct page *start_page = pfn_to_page(start_pfn);
+	unsigned long pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/*
+	 * A free page may be comprised of 2^n blocks, which means our
+	 * block of interest could be head or tail in such a page.
+	 *
+	 * If we're a tail, update the type of our block, then split
+	 * the page into pageblocks. The splitting will do the leg
+	 * work of sorting the blocks into the right freelists.
+	 *
+	 * If we're a head, split the page into pageblocks first. This
+	 * ensures the migratetypes still match up during the freelist
+	 * removal. Then do the regular scan for buddies in the block
+	 * of interest, which will handle the rest.
+	 *
+	 * In theory, we could try to preserve 2^1 and larger blocks
+	 * that lie outside our range. In practice, MAX_ORDER is
+	 * usually one or two pageblocks anyway, so don't bother.
+	 *
+	 * Note that this only applies to page isolation, which calls
+	 * this on random blocks in the pfn range! When we move stuff
+	 * from inside the page allocator, the pages are coming off
+	 * the freelist (can't be tail) and multi-block pages are
+	 * handled directly in the stealing code (can't be a head).
+	 */
+
+	/* We're a tail */
+	pfn = find_straddling_buddy(start_pfn);
+	if (pfn != start_pfn) {
+		struct page *free_page = pfn_to_page(pfn);
+
+		split_free_page(free_page, buddy_order(free_page),
+				pageblock_nr_pages, old_mt, new_mt);
+		return pageblock_nr_pages;
+	}
+
+	/* We're a head */
+	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
+		split_free_page(start_page, buddy_order(start_page),
+				pageblock_nr_pages, new_mt, old_mt);
+		return pageblock_nr_pages;
+	}
+
+	return 0;
+}
+
 /*
  * This function set pageblock migratetype to isolate if no unmovable page is
  * present in [start_pfn, end_pfn). The pageblock must intersect with
@@ -178,15 +234,17 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
+		unsigned long start_pfn, end_pfn;
 
-		nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
-		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
+
+		if (!move_freepages_for_isolation(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE))
+			move_freepages(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE);
+
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -253,13 +311,16 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		int nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
-						    migratetype);
+		unsigned long start_pfn, end_pfn;
+
 		/*
 		 * Isolating this block already succeeded, so this
 		 * should not fail on zone boundaries.
 		 */
-		WARN_ON_ONCE(nr_pages == -1);
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL))
+			WARN_ON_ONCE(1);
+		else if (!move_freepages_for_isolation(zone, start_pfn, end_pfn, MIGRATE_ISOLATE, migratetype))
+			move_freepages(zone, start_pfn, end_pfn, MIGRATE_ISOLATE, migratetype);
 	} else {
 		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
@@ -380,11 +441,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+				"a free page sits across isolation boundary");
 
 			pfn += 1UL << order;
 			continue;
@@ -408,8 +466,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			 * can be migrated. Otherwise, fail the isolation.
 			 */
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
 				int page_mt = get_pageblock_migratetype(page);
 				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
@@ -427,9 +483,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +509,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
-- 
2.42.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 13:35                                                   ` Zi Yan
@ 2023-10-16 14:37                                                     ` Johannes Weiner
  2023-10-16 15:00                                                       ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-16 14:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> > The attached patch has all the suggested changes, let me know how it
> > looks to you. Thanks.
> 
> The one I sent has free page accounting issues. The attached one fixes them.

Do you still have the warnings? I wonder what went wrong.

> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>  	del_page_from_free_list(free_page, zone, order, mt);
>  
> +	set_pageblock_migratetype(free_page, mt1);
> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
> +				  mt2);
> +
>  	for (pfn = free_page_pfn;
>  	     pfn < free_page_pfn + (1UL << order);) {
>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

I don't think this is quite right.

With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
a buddy that is more than two blocks:

[pageblock 0][pageblock 1][pageblock 2][pageblock 3]
[buddy                                             ]
                                       [isolate range ..

That for loop splits the buddy into 4 blocks. The code above would set
pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
set pageblock 3 to new_mt.

My proposal had the mt update in the caller:

> @@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>  	return NULL;
>  }
>  
> +/*
> + * additional steps for moving free pages during page isolation
> + */
> +static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
> +			  unsigned long end_pfn, int old_mt, int new_mt)
> +{
> +	struct page *start_page = pfn_to_page(start_pfn);
> +	unsigned long pfn;
> +
> +	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
> +	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
> +
> +	/*
> +	 * A free page may be comprised of 2^n blocks, which means our
> +	 * block of interest could be head or tail in such a page.
> +	 *
> +	 * If we're a tail, update the type of our block, then split
> +	 * the page into pageblocks. The splitting will do the leg
> +	 * work of sorting the blocks into the right freelists.
> +	 *
> +	 * If we're a head, split the page into pageblocks first. This
> +	 * ensures the migratetypes still match up during the freelist
> +	 * removal. Then do the regular scan for buddies in the block
> +	 * of interest, which will handle the rest.
> +	 *
> +	 * In theory, we could try to preserve 2^1 and larger blocks
> +	 * that lie outside our range. In practice, MAX_ORDER is
> +	 * usually one or two pageblocks anyway, so don't bother.
> +	 *
> +	 * Note that this only applies to page isolation, which calls
> +	 * this on random blocks in the pfn range! When we move stuff
> +	 * from inside the page allocator, the pages are coming off
> +	 * the freelist (can't be tail) and multi-block pages are
> +	 * handled directly in the stealing code (can't be a head).
> +	 */
> +
> +	/* We're a tail */
> +	pfn = find_straddling_buddy(start_pfn);
> +	if (pfn != start_pfn) {
> +		struct page *free_page = pfn_to_page(pfn);
> +
> +		split_free_page(free_page, buddy_order(free_page),
> +				pageblock_nr_pages, old_mt, new_mt);
> +		return pageblock_nr_pages;
> +	}
> +
> +	/* We're a head */
> +	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
> +		split_free_page(start_page, buddy_order(start_page),
> +				pageblock_nr_pages, new_mt, old_mt);
> +		return pageblock_nr_pages;
> +	}

i.e. here ^: set the mt of the block that's in isolation range, then
split the block.

I think I can guess the warning you were getting: in the head case, we
need to change the type of the head pageblock that's on the
freelist. If we do it before calling split, the
del_page_from_freelist() in there warns about the wrong type.

How about pulling the freelist removal out of split_free_page()?

	del_page_from_freelist(huge_buddy);
	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
	return pageblock_nr_pages;

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 14:37                                                     ` Johannes Weiner
@ 2023-10-16 15:00                                                       ` Zi Yan
  2023-10-16 18:51                                                         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-16 15:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6119 bytes --]

On 16 Oct 2023, at 10:37, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>> The attached patch has all the suggested changes, let me know how it
>>> looks to you. Thanks.
>>
>> The one I sent has free page accounting issues. The attached one fixes them.
>
> Do you still have the warnings? I wonder what went wrong.

No warnings. But something with the code:

1. in your version, split_free_page() is called without changing any pageblock
migratetypes, then split_free_page() is just a no-op, since the page is
just deleted from the free list, then freed via different orders. Buddy allocator
will merge them back.

2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
but it causes free page accounting issues, since in the case of head, free pages
are deleted from new_mt when they are in old_mt free list and the accounting
decreases new_mt free page number instead of old_mt one.

Basically, split_free_page() is awkward as it relies on preset migratetypes,
which changes migratetypes without deleting the free pages from the list first.
That is why I came up with the new split_free_page() below.

>
>> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>>  	del_page_from_free_list(free_page, zone, order, mt);
>>
>> +	set_pageblock_migratetype(free_page, mt1);
>> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
>> +				  mt2);
>> +
>>  	for (pfn = free_page_pfn;
>>  	     pfn < free_page_pfn + (1UL << order);) {
>>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>
> I don't think this is quite right.
>
> With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
> a buddy that is more than two blocks:
>
> [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
> [buddy                                             ]
>                                        [isolate range ..
>
> That for loop splits the buddy into 4 blocks. The code above would set
> pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
> set pageblock 3 to new_mt.

OK. I think I need to fix split_free_page().

Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
pageblock and in turn makes an in-use page have more than one pageblock,
we will have problems. Since in isolate_single_pageblock(), an in-use page
can have part of its pageblock set to a different migratetype and be freed,
causing the free page with unmatched migratetypes. We might need to
free pages at pageblock_order if their orders are bigger than pageblock_order.

Which arch with CONFIG_ARCH_FORCE_MAX_ORDER can have a buddy containing more
than one pageblocks? I would like to make some tests.

>
> My proposal had the mt update in the caller:
>
>> @@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>  	return NULL;
>>  }
>>
>> +/*
>> + * additional steps for moving free pages during page isolation
>> + */
>> +static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
>> +			  unsigned long end_pfn, int old_mt, int new_mt)
>> +{
>> +	struct page *start_page = pfn_to_page(start_pfn);
>> +	unsigned long pfn;
>> +
>> +	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
>> +	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>> +
>> +	/*
>> +	 * A free page may be comprised of 2^n blocks, which means our
>> +	 * block of interest could be head or tail in such a page.
>> +	 *
>> +	 * If we're a tail, update the type of our block, then split
>> +	 * the page into pageblocks. The splitting will do the leg
>> +	 * work of sorting the blocks into the right freelists.
>> +	 *
>> +	 * If we're a head, split the page into pageblocks first. This
>> +	 * ensures the migratetypes still match up during the freelist
>> +	 * removal. Then do the regular scan for buddies in the block
>> +	 * of interest, which will handle the rest.
>> +	 *
>> +	 * In theory, we could try to preserve 2^1 and larger blocks
>> +	 * that lie outside our range. In practice, MAX_ORDER is
>> +	 * usually one or two pageblocks anyway, so don't bother.
>> +	 *
>> +	 * Note that this only applies to page isolation, which calls
>> +	 * this on random blocks in the pfn range! When we move stuff
>> +	 * from inside the page allocator, the pages are coming off
>> +	 * the freelist (can't be tail) and multi-block pages are
>> +	 * handled directly in the stealing code (can't be a head).
>> +	 */
>> +
>> +	/* We're a tail */
>> +	pfn = find_straddling_buddy(start_pfn);
>> +	if (pfn != start_pfn) {
>> +		struct page *free_page = pfn_to_page(pfn);
>> +
>> +		split_free_page(free_page, buddy_order(free_page),
>> +				pageblock_nr_pages, old_mt, new_mt);
>> +		return pageblock_nr_pages;
>> +	}
>> +
>> +	/* We're a head */
>> +	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
>> +		split_free_page(start_page, buddy_order(start_page),
>> +				pageblock_nr_pages, new_mt, old_mt);
>> +		return pageblock_nr_pages;
>> +	}
>
> i.e. here ^: set the mt of the block that's in isolation range, then
> split the block.
>
> I think I can guess the warning you were getting: in the head case, we
> need to change the type of the head pageblock that's on the
> freelist. If we do it before calling split, the
> del_page_from_freelist() in there warns about the wrong type.
>
> How about pulling the freelist removal out of split_free_page()?
>
> 	del_page_from_freelist(huge_buddy);
> 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
> 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
> 	return pageblock_nr_pages;

Yes, this is better. Let me change to this implementation.

But I would like to test it on an environment where a buddy contains more than
one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
I will report back.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 15:00                                                       ` Zi Yan
@ 2023-10-16 18:51                                                         ` Johannes Weiner
  2023-10-16 19:49                                                           ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-16 18:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> 
> > On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> >>> The attached patch has all the suggested changes, let me know how it
> >>> looks to you. Thanks.
> >>
> >> The one I sent has free page accounting issues. The attached one fixes them.
> >
> > Do you still have the warnings? I wonder what went wrong.
> 
> No warnings. But something with the code:
> 
> 1. in your version, split_free_page() is called without changing any pageblock
> migratetypes, then split_free_page() is just a no-op, since the page is
> just deleted from the free list, then freed via different orders. Buddy allocator
> will merge them back.

Hm not quite.

If it's the tail block of a buddy, I update its type before
splitting. The splitting loop looks up the type of each block for
sorting it onto freelists.

If it's the head block, yes I split it first according to its old
type. But then I let it fall through to scanning the block, which will
find that buddy, update its type and move it.

> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> but it causes free page accounting issues, since in the case of head, free pages
> are deleted from new_mt when they are in old_mt free list and the accounting
> decreases new_mt free page number instead of old_mt one.

Right, that makes sense.

> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> which changes migratetypes without deleting the free pages from the list first.
> That is why I came up with the new split_free_page() below.

Yeah, the in-between thing is bad. Either it fixes the migratetype
before deletion, or it doesn't do the deletion. I'm thinking it would
be simpler to move the deletion out instead.

> >> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
> >>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
> >>  	del_page_from_free_list(free_page, zone, order, mt);
> >>
> >> +	set_pageblock_migratetype(free_page, mt1);
> >> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
> >> +				  mt2);
> >> +
> >>  	for (pfn = free_page_pfn;
> >>  	     pfn < free_page_pfn + (1UL << order);) {
> >>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> >
> > I don't think this is quite right.
> >
> > With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
> > a buddy that is more than two blocks:
> >
> > [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
> > [buddy                                             ]
> >                                        [isolate range ..
> >
> > That for loop splits the buddy into 4 blocks. The code above would set
> > pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
> > set pageblock 3 to new_mt.
> 
> OK. I think I need to fix split_free_page().
> 
> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> pageblock and in turn makes an in-use page have more than one pageblock,
> we will have problems. Since in isolate_single_pageblock(), an in-use page
> can have part of its pageblock set to a different migratetype and be freed,
> causing the free page with unmatched migratetypes. We might need to
> free pages at pageblock_order if their orders are bigger than pageblock_order.

Is this a practical issue? You mentioned that right now only gigantic
pages can be larger than a pageblock, and those are freed in order-0
chunks.

> > How about pulling the freelist removal out of split_free_page()?
> >
> > 	del_page_from_freelist(huge_buddy);
> > 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
> > 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
> > 	return pageblock_nr_pages;
> 
> Yes, this is better. Let me change to this implementation.
> 
> But I would like to test it on an environment where a buddy contains more than
> one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
> I will report back.

I tweaked my version some more based on our discussion. Would you mind
taking a look? It survived an hour of stressing with a kernel build
and Mike's reproducer that allocates gigantics and demotes them.

Note that it applies *before* consolidating of the free counts, as
isolation needs to be fixed before the warnings are added, to avoid
bisectability issues. The consolidation patch doesn't change it much,
except removing freepage accounting in move_freepages_block_isolate().

---

From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 16 Oct 2023 12:32:21 -0400
Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page-isolation.h |   4 +-
 mm/internal.h                  |   4 -
 mm/page_alloc.c                | 198 +++++++++++++++++++--------------
 mm/page_isolation.c            |  96 +++++-----------
 4 files changed, 142 insertions(+), 160 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 8550b3c91480..c16db0067090 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
+
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+				  int migratetype);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/internal.h b/mm/internal.h
index 3a72975425bb..0681094ad260 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -464,10 +464,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
 		unsigned long, enum meminit_context, struct vmem_altmap *, int);
 
-
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6185b076cf90..17e9a06027c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -834,64 +834,6 @@ static inline void __free_one_page(struct page *page,
 		page_reporting_notify_free(order);
 }
 
-/**
- * split_free_page() -- split a free page at split_pfn_offset
- * @free_page:		the original free page
- * @order:		the order of the page
- * @split_pfn_offset:	split offset within the page
- *
- * Return -ENOENT if the free page is changed, otherwise 0
- *
- * It is used when the free page crosses two pageblocks with different migratetypes
- * at split_pfn_offset within the page. The split free page will be put into
- * separate migratetype lists afterwards. Otherwise, the function achieves
- * nothing.
- */
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset)
-{
-	struct zone *zone = page_zone(free_page);
-	unsigned long free_page_pfn = page_to_pfn(free_page);
-	unsigned long pfn;
-	unsigned long flags;
-	int free_page_order;
-	int mt;
-	int ret = 0;
-
-	if (split_pfn_offset == 0)
-		return ret;
-
-	spin_lock_irqsave(&zone->lock, flags);
-
-	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
-		ret = -ENOENT;
-		goto out;
-	}
-
-	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
-	if (likely(!is_migrate_isolate(mt)))
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
-
-	del_page_from_free_list(free_page, zone, order);
-	for (pfn = free_page_pfn;
-	     pfn < free_page_pfn + (1UL << order);) {
-		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
-
-		free_page_order = min_t(unsigned int,
-					pfn ? __ffs(pfn) : order,
-					__fls(split_pfn_offset));
-		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
-				mt, FPI_NONE);
-		pfn += 1UL << free_page_order;
-		split_pfn_offset -= (1UL << free_page_order);
-		/* we have done the first part, now switch to second part */
-		if (split_pfn_offset == 0)
-			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
-	}
-out:
-	spin_unlock_irqrestore(&zone->lock, flags);
-	return ret;
-}
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -1673,8 +1615,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 	return true;
 }
 
-int move_freepages_block(struct zone *zone, struct page *page,
-			 int migratetype)
+static int move_freepages_block(struct zone *zone, struct page *page,
+				int migratetype)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -1685,6 +1627,117 @@ int move_freepages_block(struct zone *zone, struct page *page,
 	return move_freepages(zone, start_pfn, end_pfn, migratetype);
 }
 
+#ifdef CONFIG_MEMORY_ISOLATION
+/* Look for a multi-block buddy that straddles start_pfn */
+static unsigned long find_large_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
+/* Split a multi-block buddy into its individual pageblocks */
+static void split_large_buddy(struct page *buddy, int order)
+{
+	unsigned long pfn = page_to_pfn(buddy);
+	unsigned long end = pfn + (1 << order);
+	struct zone *zone = page_zone(buddy);
+
+	lockdep_assert_held(&zone->lock);
+	VM_WARN_ON_ONCE(PageBuddy(buddy));
+
+	while (pfn < end) {
+		int mt = get_pfnblock_migratetype(buddy, pfn);
+
+		__free_one_page(buddy, pfn, zone, pageblock_order, mt, FPI_NONE);
+		pfn += pageblock_nr_pages;
+		buddy = pfn_to_page(pfn);
+	}
+}
+
+/**
+ * move_freepages_block_isolate - move free pages in block for page isolation
+ * @zone: the zone
+ * @page: the pageblock page
+ * @migratetype: migratetype to set on the pageblock
+ *
+ * This is similar to move_freepages_block(), but handles the special
+ * case encountered in page isolation, where the block of interest
+ * might be part of a larger buddy spanning multiple pageblocks.
+ *
+ * Unlike the regular page allocator path, which moves pages while
+ * stealing buddies off the freelist, page isolation is interested in
+ * arbitrary pfn ranges that may have overlapping buddies on both ends.
+ *
+ * This function handles that. Straddling buddies are split into
+ * individual pageblocks. Only the block of interest is moved.
+ *
+ * Returns %true if pages could be moved, %false otherwise.
+ */
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+				  int migratetype)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	int nr_moved, mt;
+
+	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+				       NULL, NULL))
+		return false;
+
+	/* We're a tail block in a larger buddy */
+	pfn = find_large_buddy(start_pfn);
+	if (pfn != start_pfn) {
+		struct page *buddy = pfn_to_page(pfn);
+		int order = buddy_order(buddy);
+		int mt = get_pfnblock_migratetype(buddy, pfn);
+
+		if (!is_migrate_isolate(mt))
+			__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		del_page_from_free_list(buddy, zone, order);
+		set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
+		split_large_buddy(buddy, order);
+		return true;
+	}
+
+	/* We're the starting block of a larger buddy */
+	if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
+		int mt = get_pfnblock_migratetype(page, pfn);
+		int order = buddy_order(page);
+
+		if (!is_migrate_isolate(mt))
+			__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		del_page_from_free_list(page, zone, order);
+		set_pageblock_migratetype(page, migratetype);
+		split_large_buddy(page, order);
+		return true;
+	}
+
+	mt = get_pfnblock_migratetype(page, start_pfn);
+	nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
+	if (!is_migrate_isolate(mt))
+		__mod_zone_freepage_state(zone, -nr_moved, mt);
+	else if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, nr_moved, migratetype);
+	return true;
+}
+#endif /* CONFIG_MEMORY_ISOLATION */
+
 static void change_pageblock_range(struct page *pageblock_page,
 					int start_order, int migratetype)
 {
@@ -6318,7 +6371,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6391,29 +6443,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * We don't have to hold zone->lock here because the pages are
 	 * isolated thus they won't get removed from buddy.
 	 */
-
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	outer_start = find_large_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 27ee994a57d3..b4d53545496d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
-		int mt = get_pageblock_migratetype(page);
-
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
-		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		int nr_pages = move_freepages_block(zone, page, migratetype);
 		/*
 		 * Isolating this block already succeeded, so this
 		 * should not fail on zone boundaries.
 		 */
-		WARN_ON_ONCE(nr_pages == -1);
-		__mod_zone_freepage_state(zone, nr_pages, migratetype);
+		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
 	} else {
 		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
@@ -373,26 +365,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 		VM_BUG_ON(!page);
 		pfn = page_to_pfn(page);
-		/*
-		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
-		 * free pages in [start_pfn, boundary_pfn), its head page will
-		 * always be in the range.
-		 */
+
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			/* move_freepages_block_isolate() handled this */
+			VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
 
 			pfn += 1UL << order;
 			continue;
 		}
+
 		/*
-		 * migrate compound pages then let the free page handling code
-		 * above do the rest. If migration is not possible, just fail.
+		 * If a compound page is straddling our block, attempt
+		 * to migrate it out of the way.
+		 *
+		 * We don't have to worry about this creating a large
+		 * free page that straddles into our block: gigantic
+		 * pages are freed as order-0 chunks, and LRU pages
+		 * (currently) do not exceed pageblock_order.
+		 *
+		 * The block of interest has already been marked
+		 * MIGRATE_ISOLATE above, so when migration is done it
+		 * will free its pages onto the correct freelists.
 		 */
 		if (PageCompound(page)) {
 			struct page *head = compound_head(page);
@@ -403,16 +398,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
+
+			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 			/*
-			 * hugetlb, lru compound (THP), and movable compound pages
-			 * can be migrated. Otherwise, fail the isolation.
+			 * hugetlb, and movable compound pages can be
+			 * migrated. Otherwise, fail the isolation.
 			 */
-			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
-				int page_mt = get_pageblock_migratetype(page);
-				bool isolate_page = !is_migrate_isolate_page(page);
+			if (PageHuge(page) || __PageMovable(page)) {
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -425,52 +419,12 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				};
 				INIT_LIST_HEAD(&cc.migratepages);
 
-				/*
-				 * XXX: mark the page as MIGRATE_ISOLATE so that
-				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
-				 */
-				if (isolate_page) {
-					ret = set_migratetype_isolate(page, page_mt,
-						flags, head_pfn, head_pfn + nr_pages);
-					if (ret)
-						goto failed;
-				}
-
 				ret = __alloc_contig_migrate_range(&cc, head_pfn,
 							head_pfn + nr_pages);
-
-				/*
-				 * restore the page's migratetype so that it can
-				 * be split into separate migratetype free lists
-				 * later.
-				 */
-				if (isolate_page)
-					unset_migratetype_isolate(page, page_mt);
-
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 18:51                                                         ` Johannes Weiner
@ 2023-10-16 19:49                                                           ` Zi Yan
  2023-10-16 20:26                                                             ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Zi Yan @ 2023-10-16 19:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 21711 bytes --]

On 16 Oct 2023, at 14:51, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
>> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
>>
>>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>>>> The attached patch has all the suggested changes, let me know how it
>>>>> looks to you. Thanks.
>>>>
>>>> The one I sent has free page accounting issues. The attached one fixes them.
>>>
>>> Do you still have the warnings? I wonder what went wrong.
>>
>> No warnings. But something with the code:
>>
>> 1. in your version, split_free_page() is called without changing any pageblock
>> migratetypes, then split_free_page() is just a no-op, since the page is
>> just deleted from the free list, then freed via different orders. Buddy allocator
>> will merge them back.
>
> Hm not quite.
>
> If it's the tail block of a buddy, I update its type before
> splitting. The splitting loop looks up the type of each block for
> sorting it onto freelists.
>
> If it's the head block, yes I split it first according to its old
> type. But then I let it fall through to scanning the block, which will
> find that buddy, update its type and move it.

That is the issue, since split_free_page() assumes the pageblocks of
that free page have different types. It basically just free the page
with different small orders summed up to the original free page order.
If all pageblocks of the free page have the same migratetype, __free_one_page()
will merge these small order pages back to the original order free page.

>
>> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
>> but it causes free page accounting issues, since in the case of head, free pages
>> are deleted from new_mt when they are in old_mt free list and the accounting
>> decreases new_mt free page number instead of old_mt one.
>
> Right, that makes sense.
>
>> Basically, split_free_page() is awkward as it relies on preset migratetypes,
>> which changes migratetypes without deleting the free pages from the list first.
>> That is why I came up with the new split_free_page() below.
>
> Yeah, the in-between thing is bad. Either it fixes the migratetype
> before deletion, or it doesn't do the deletion. I'm thinking it would
> be simpler to move the deletion out instead.

Yes and no. After deletion, a free page no longer has PageBuddy set and
has buddy_order information cleared. Either we reset PageBuddy and order
to the deleted free page, or split_free_page() needs to be changed to
accept pages without the information (basically remove the PageBuddy
and order check code).

>>>> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>>>>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>>>>  	del_page_from_free_list(free_page, zone, order, mt);
>>>>
>>>> +	set_pageblock_migratetype(free_page, mt1);
>>>> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
>>>> +				  mt2);
>>>> +
>>>>  	for (pfn = free_page_pfn;
>>>>  	     pfn < free_page_pfn + (1UL << order);) {
>>>>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>>>
>>> I don't think this is quite right.
>>>
>>> With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
>>> a buddy that is more than two blocks:
>>>
>>> [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
>>> [buddy                                             ]
>>>                                        [isolate range ..
>>>
>>> That for loop splits the buddy into 4 blocks. The code above would set
>>> pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
>>> set pageblock 3 to new_mt.
>>
>> OK. I think I need to fix split_free_page().
>>
>> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
>> pageblock and in turn makes an in-use page have more than one pageblock,
>> we will have problems. Since in isolate_single_pageblock(), an in-use page
>> can have part of its pageblock set to a different migratetype and be freed,
>> causing the free page with unmatched migratetypes. We might need to
>> free pages at pageblock_order if their orders are bigger than pageblock_order.
>
> Is this a practical issue? You mentioned that right now only gigantic
> pages can be larger than a pageblock, and those are freed in order-0
> chunks.

Only if the system allocates a page (non hugetlb pages) with >pageblock_order
and frees it with the same order. I just do not know if such pages exist on
other arch than x86. Maybe I just think too much.

>
>>> How about pulling the freelist removal out of split_free_page()?
>>>
>>> 	del_page_from_freelist(huge_buddy);
>>> 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
>>> 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
>>> 	return pageblock_nr_pages;
>>
>> Yes, this is better. Let me change to this implementation.
>>
>> But I would like to test it on an environment where a buddy contains more than
>> one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
>> I will report back.
>
> I tweaked my version some more based on our discussion. Would you mind
> taking a look? It survived an hour of stressing with a kernel build
> and Mike's reproducer that allocates gigantics and demotes them.
>
> Note that it applies *before* consolidating of the free counts, as
> isolation needs to be fixed before the warnings are added, to avoid
> bisectability issues. The consolidation patch doesn't change it much,
> except removing freepage accounting in move_freepages_block_isolate().
>
> ---
>
> From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 16 Oct 2023 12:32:21 -0400
> Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/page-isolation.h |   4 +-
>  mm/internal.h                  |   4 -
>  mm/page_alloc.c                | 198 +++++++++++++++++++--------------
>  mm/page_isolation.c            |  96 +++++-----------
>  4 files changed, 142 insertions(+), 160 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 8550b3c91480..c16db0067090 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype)
>  #define REPORT_FAILURE	0x2
>
>  void set_pageblock_migratetype(struct page *page, int migratetype);
> -int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
> +
> +bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> +				  int migratetype);
>
>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  			     int migratetype, int flags, gfp_t gfp_flags);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3a72975425bb..0681094ad260 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -464,10 +464,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>  void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
>  		unsigned long, enum meminit_context, struct vmem_altmap *, int);
>
> -
> -int split_free_page(struct page *free_page,
> -			unsigned int order, unsigned long split_pfn_offset);
> -
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6185b076cf90..17e9a06027c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -834,64 +834,6 @@ static inline void __free_one_page(struct page *page,
>  		page_reporting_notify_free(order);
>  }
>
> -/**
> - * split_free_page() -- split a free page at split_pfn_offset
> - * @free_page:		the original free page
> - * @order:		the order of the page
> - * @split_pfn_offset:	split offset within the page
> - *
> - * Return -ENOENT if the free page is changed, otherwise 0
> - *
> - * It is used when the free page crosses two pageblocks with different migratetypes
> - * at split_pfn_offset within the page. The split free page will be put into
> - * separate migratetype lists afterwards. Otherwise, the function achieves
> - * nothing.
> - */
> -int split_free_page(struct page *free_page,
> -			unsigned int order, unsigned long split_pfn_offset)
> -{
> -	struct zone *zone = page_zone(free_page);
> -	unsigned long free_page_pfn = page_to_pfn(free_page);
> -	unsigned long pfn;
> -	unsigned long flags;
> -	int free_page_order;
> -	int mt;
> -	int ret = 0;
> -
> -	if (split_pfn_offset == 0)
> -		return ret;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -
> -	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
> -		ret = -ENOENT;
> -		goto out;
> -	}
> -
> -	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
> -	if (likely(!is_migrate_isolate(mt)))
> -		__mod_zone_freepage_state(zone, -(1UL << order), mt);
> -
> -	del_page_from_free_list(free_page, zone, order);
> -	for (pfn = free_page_pfn;
> -	     pfn < free_page_pfn + (1UL << order);) {
> -		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> -
> -		free_page_order = min_t(unsigned int,
> -					pfn ? __ffs(pfn) : order,
> -					__fls(split_pfn_offset));
> -		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
> -				mt, FPI_NONE);
> -		pfn += 1UL << free_page_order;
> -		split_pfn_offset -= (1UL << free_page_order);
> -		/* we have done the first part, now switch to second part */
> -		if (split_pfn_offset == 0)
> -			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
> -	}
> -out:
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -	return ret;
> -}
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple branches,
>   * try and check multiple fields with one check. The caller must do a detailed
> @@ -1673,8 +1615,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>  	return true;
>  }
>
> -int move_freepages_block(struct zone *zone, struct page *page,
> -			 int migratetype)
> +static int move_freepages_block(struct zone *zone, struct page *page,
> +				int migratetype)
>  {
>  	unsigned long start_pfn, end_pfn;
>
> @@ -1685,6 +1627,117 @@ int move_freepages_block(struct zone *zone, struct page *page,
>  	return move_freepages(zone, start_pfn, end_pfn, migratetype);
>  }
>
> +#ifdef CONFIG_MEMORY_ISOLATION
> +/* Look for a multi-block buddy that straddles start_pfn */
> +static unsigned long find_large_buddy(unsigned long start_pfn)
> +{
> +	int order = 0;
> +	struct page *page;
> +	unsigned long pfn = start_pfn;
> +
> +	while (!PageBuddy(page = pfn_to_page(pfn))) {
> +		/* Nothing found */
> +		if (++order > MAX_ORDER)
> +			return start_pfn;
> +		pfn &= ~0UL << order;
> +	}
> +
> +	/*
> +	 * Found a preceding buddy, but does it straddle?
> +	 */
> +	if (pfn + (1 << buddy_order(page)) > start_pfn)
> +		return pfn;
> +
> +	/* Nothing found */
> +	return start_pfn;
> +}
> +
> +/* Split a multi-block buddy into its individual pageblocks */
> +static void split_large_buddy(struct page *buddy, int order)
> +{
> +	unsigned long pfn = page_to_pfn(buddy);
> +	unsigned long end = pfn + (1 << order);
> +	struct zone *zone = page_zone(buddy);
> +
> +	lockdep_assert_held(&zone->lock);
> +	VM_WARN_ON_ONCE(PageBuddy(buddy));
> +
> +	while (pfn < end) {
> +		int mt = get_pfnblock_migratetype(buddy, pfn);
> +
> +		__free_one_page(buddy, pfn, zone, pageblock_order, mt, FPI_NONE);
> +		pfn += pageblock_nr_pages;
> +		buddy = pfn_to_page(pfn);
> +	}
> +}
> +
> +/**
> + * move_freepages_block_isolate - move free pages in block for page isolation
> + * @zone: the zone
> + * @page: the pageblock page
> + * @migratetype: migratetype to set on the pageblock
> + *
> + * This is similar to move_freepages_block(), but handles the special
> + * case encountered in page isolation, where the block of interest
> + * might be part of a larger buddy spanning multiple pageblocks.
> + *
> + * Unlike the regular page allocator path, which moves pages while
> + * stealing buddies off the freelist, page isolation is interested in
> + * arbitrary pfn ranges that may have overlapping buddies on both ends.
> + *
> + * This function handles that. Straddling buddies are split into
> + * individual pageblocks. Only the block of interest is moved.
> + *
> + * Returns %true if pages could be moved, %false otherwise.
> + */
> +bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> +				  int migratetype)
> +{
> +	unsigned long start_pfn, end_pfn, pfn;
> +	int nr_moved, mt;
> +
> +	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
> +				       NULL, NULL))
> +		return false;
> +
> +	/* We're a tail block in a larger buddy */
> +	pfn = find_large_buddy(start_pfn);
> +	if (pfn != start_pfn) {
> +		struct page *buddy = pfn_to_page(pfn);
> +		int order = buddy_order(buddy);
> +		int mt = get_pfnblock_migratetype(buddy, pfn);
> +
> +		if (!is_migrate_isolate(mt))
> +			__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		del_page_from_free_list(buddy, zone, order);
> +		set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
> +		split_large_buddy(buddy, order);
> +		return true;
> +	}
> +
> +	/* We're the starting block of a larger buddy */
> +	if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
> +		int mt = get_pfnblock_migratetype(page, pfn);
> +		int order = buddy_order(page);
> +
> +		if (!is_migrate_isolate(mt))
> +			__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		del_page_from_free_list(page, zone, order);
> +		set_pageblock_migratetype(page, migratetype);
> +		split_large_buddy(page, order);
> +		return true;
> +	}
> +
> +	mt = get_pfnblock_migratetype(page, start_pfn);
> +	nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
> +	if (!is_migrate_isolate(mt))
> +		__mod_zone_freepage_state(zone, -nr_moved, mt);
> +	else if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, nr_moved, migratetype);
> +	return true;
> +}
> +#endif /* CONFIG_MEMORY_ISOLATION */
> +
>  static void change_pageblock_range(struct page *pageblock_page,
>  					int start_order, int migratetype)
>  {
> @@ -6318,7 +6371,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype, gfp_t gfp_mask)
>  {
>  	unsigned long outer_start, outer_end;
> -	int order;
>  	int ret = 0;
>
>  	struct compact_control cc = {
> @@ -6391,29 +6443,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * We don't have to hold zone->lock here because the pages are
>  	 * isolated thus they won't get removed from buddy.
>  	 */
> -
> -	order = 0;
> -	outer_start = start;
> -	while (!PageBuddy(pfn_to_page(outer_start))) {
> -		if (++order > MAX_ORDER) {
> -			outer_start = start;
> -			break;
> -		}
> -		outer_start &= ~0UL << order;
> -	}
> -
> -	if (outer_start != start) {
> -		order = buddy_order(pfn_to_page(outer_start));
> -
> -		/*
> -		 * outer_start page could be small order buddy page and
> -		 * it doesn't include start page. Adjust outer_start
> -		 * in this case to report failed page properly
> -		 * on tracepoint in test_pages_isolated()
> -		 */
> -		if (outer_start + (1UL << order) <= start)
> -			outer_start = start;
> -	}
> +	outer_start = find_large_buddy(start);
>
>  	/* Make sure the range is really isolated. */
>  	if (test_pages_isolated(outer_start, end, 0)) {
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 27ee994a57d3..b4d53545496d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
>  			migratetype, isol_flags);
>  	if (!unmovable) {
> -		int nr_pages;
> -		int mt = get_pageblock_migratetype(page);
> -
> -		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> -		/* Block spans zone boundaries? */
> -		if (nr_pages == -1) {
> +		if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) {
>  			spin_unlock_irqrestore(&zone->lock, flags);
>  			return -EBUSY;
>  		}
> -		__mod_zone_freepage_state(zone, -nr_pages, mt);
>  		zone->nr_isolate_pageblock++;
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  		return 0;
> @@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
>  	 * allocation.
>  	 */
>  	if (!isolated_page) {
> -		int nr_pages = move_freepages_block(zone, page, migratetype);
>  		/*
>  		 * Isolating this block already succeeded, so this
>  		 * should not fail on zone boundaries.
>  		 */
> -		WARN_ON_ONCE(nr_pages == -1);
> -		__mod_zone_freepage_state(zone, nr_pages, migratetype);
> +		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
>  	} else {
>  		set_pageblock_migratetype(page, migratetype);
>  		__putback_isolated_page(page, order, migratetype);
> @@ -373,26 +365,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>
>  		VM_BUG_ON(!page);
>  		pfn = page_to_pfn(page);
> -		/*
> -		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
> -		 * free pages in [start_pfn, boundary_pfn), its head page will
> -		 * always be in the range.
> -		 */
> +
>  		if (PageBuddy(page)) {
>  			int order = buddy_order(page);
>
> -			if (pfn + (1UL << order) > boundary_pfn) {
> -				/* free page changed before split, check it again */
> -				if (split_free_page(page, order, boundary_pfn - pfn))
> -					continue;
> -			}
> +			/* move_freepages_block_isolate() handled this */
> +			VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
>
>  			pfn += 1UL << order;
>  			continue;
>  		}
> +
>  		/*
> -		 * migrate compound pages then let the free page handling code
> -		 * above do the rest. If migration is not possible, just fail.
> +		 * If a compound page is straddling our block, attempt
> +		 * to migrate it out of the way.
> +		 *
> +		 * We don't have to worry about this creating a large
> +		 * free page that straddles into our block: gigantic
> +		 * pages are freed as order-0 chunks, and LRU pages
> +		 * (currently) do not exceed pageblock_order.
> +		 *
> +		 * The block of interest has already been marked
> +		 * MIGRATE_ISOLATE above, so when migration is done it
> +		 * will free its pages onto the correct freelists.
>  		 */
>  		if (PageCompound(page)) {
>  			struct page *head = compound_head(page);
> @@ -403,16 +398,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				pfn = head_pfn + nr_pages;
>  				continue;
>  			}
> +
> +			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  			/*
> -			 * hugetlb, lru compound (THP), and movable compound pages
> -			 * can be migrated. Otherwise, fail the isolation.
> +			 * hugetlb, and movable compound pages can be
> +			 * migrated. Otherwise, fail the isolation.
>  			 */
> -			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
> -				int order;
> -				unsigned long outer_pfn;
> -				int page_mt = get_pageblock_migratetype(page);
> -				bool isolate_page = !is_migrate_isolate_page(page);
> +			if (PageHuge(page) || __PageMovable(page)) {
>  				struct compact_control cc = {
>  					.nr_migratepages = 0,
>  					.order = -1,
> @@ -425,52 +419,12 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				};
>  				INIT_LIST_HEAD(&cc.migratepages);
>
> -				/*
> -				 * XXX: mark the page as MIGRATE_ISOLATE so that
> -				 * no one else can grab the freed page after migration.
> -				 * Ideally, the page should be freed as two separate
> -				 * pages to be added into separate migratetype free
> -				 * lists.
> -				 */
> -				if (isolate_page) {
> -					ret = set_migratetype_isolate(page, page_mt,
> -						flags, head_pfn, head_pfn + nr_pages);
> -					if (ret)
> -						goto failed;
> -				}
> -
>  				ret = __alloc_contig_migrate_range(&cc, head_pfn,
>  							head_pfn + nr_pages);
> -
> -				/*
> -				 * restore the page's migratetype so that it can
> -				 * be split into separate migratetype free lists
> -				 * later.
> -				 */
> -				if (isolate_page)
> -					unset_migratetype_isolate(page, page_mt);
> -
>  				if (ret)
>  					goto failed;
> -				/*
> -				 * reset pfn to the head of the free page, so
> -				 * that the free page handling code above can split
> -				 * the free page to the right migratetype list.
> -				 *
> -				 * head_pfn is not used here as a hugetlb page order
> -				 * can be bigger than MAX_ORDER, but after it is
> -				 * freed, the free page order is not. Use pfn within
> -				 * the range to find the head of the free page.
> -				 */
> -				order = 0;
> -				outer_pfn = pfn;
> -				while (!PageBuddy(pfn_to_page(outer_pfn))) {
> -					/* stop if we cannot find the free page */
> -					if (++order > MAX_ORDER)
> -						goto failed;
> -					outer_pfn &= ~0UL << order;
> -				}
> -				pfn = outer_pfn;
> +
> +				pfn = head_pfn + nr_pages;
>  				continue;
>  			} else
>  #endif
> -- 
> 2.42.0

It looks good to me. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 19:49                                                           ` Zi Yan
@ 2023-10-16 20:26                                                             ` Johannes Weiner
  2023-10-16 20:39                                                               ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-16 20:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
> On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
> 
> > On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> >> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> >>
> >>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> >>>>> The attached patch has all the suggested changes, let me know how it
> >>>>> looks to you. Thanks.
> >>>>
> >>>> The one I sent has free page accounting issues. The attached one fixes them.
> >>>
> >>> Do you still have the warnings? I wonder what went wrong.
> >>
> >> No warnings. But something with the code:
> >>
> >> 1. in your version, split_free_page() is called without changing any pageblock
> >> migratetypes, then split_free_page() is just a no-op, since the page is
> >> just deleted from the free list, then freed via different orders. Buddy allocator
> >> will merge them back.
> >
> > Hm not quite.
> >
> > If it's the tail block of a buddy, I update its type before
> > splitting. The splitting loop looks up the type of each block for
> > sorting it onto freelists.
> >
> > If it's the head block, yes I split it first according to its old
> > type. But then I let it fall through to scanning the block, which will
> > find that buddy, update its type and move it.
> 
> That is the issue, since split_free_page() assumes the pageblocks of
> that free page have different types. It basically just free the page
> with different small orders summed up to the original free page order.
> If all pageblocks of the free page have the same migratetype, __free_one_page()
> will merge these small order pages back to the original order free page.

duh, of course, you're right. Thanks for patiently explaining this.

> >> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> >> but it causes free page accounting issues, since in the case of head, free pages
> >> are deleted from new_mt when they are in old_mt free list and the accounting
> >> decreases new_mt free page number instead of old_mt one.
> >
> > Right, that makes sense.
> >
> >> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> >> which changes migratetypes without deleting the free pages from the list first.
> >> That is why I came up with the new split_free_page() below.
> >
> > Yeah, the in-between thing is bad. Either it fixes the migratetype
> > before deletion, or it doesn't do the deletion. I'm thinking it would
> > be simpler to move the deletion out instead.
> 
> Yes and no. After deletion, a free page no longer has PageBuddy set and
> has buddy_order information cleared. Either we reset PageBuddy and order
> to the deleted free page, or split_free_page() needs to be changed to
> accept pages without the information (basically remove the PageBuddy
> and order check code).

Good point, that requires extra care.

It's correct in the code now, but it deserves a comment, especially
because of the "buddy" naming in the new split function.

> >> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> >> pageblock and in turn makes an in-use page have more than one pageblock,
> >> we will have problems. Since in isolate_single_pageblock(), an in-use page
> >> can have part of its pageblock set to a different migratetype and be freed,
> >> causing the free page with unmatched migratetypes. We might need to
> >> free pages at pageblock_order if their orders are bigger than pageblock_order.
> >
> > Is this a practical issue? You mentioned that right now only gigantic
> > pages can be larger than a pageblock, and those are freed in order-0
> > chunks.
> 
> Only if the system allocates a page (non hugetlb pages) with >pageblock_order
> and frees it with the same order. I just do not know if such pages exist on
> other arch than x86. Maybe I just think too much.

Hm, I removed LRU pages from the handling (and added the warning) but
I left in PageMovable(). The only users are z3fold, zsmalloc and
memory ballooning. AFAICS none of them can be bigger than a pageblock.
Let me remove that and add a warning for that case as well.

This way, we only attempt to migrate hugetlb, where we know the free
path - and get warnings for anything else that's larger than expected.

This seems like the safest option. On the off chance that there is a
regression, it won't jeopardize anybody's systems, while the warning
provides all the information we need to debug what's going on.

> > From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 16 Oct 2023 12:32:21 -0400
> > Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

> It looks good to me. Thanks.
> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>

Thank you for all your help!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 20:26                                                             ` Johannes Weiner
@ 2023-10-16 20:39                                                               ` Johannes Weiner
  2023-10-16 20:48                                                                 ` Zi Yan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2023-10-16 20:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

On Mon, Oct 16, 2023 at 04:26:30PM -0400, Johannes Weiner wrote:
> On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
> > On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
> > 
> > > On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> > >> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> > >>
> > >>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> > >>>>> The attached patch has all the suggested changes, let me know how it
> > >>>>> looks to you. Thanks.
> > >>>>
> > >>>> The one I sent has free page accounting issues. The attached one fixes them.
> > >>>
> > >>> Do you still have the warnings? I wonder what went wrong.
> > >>
> > >> No warnings. But something with the code:
> > >>
> > >> 1. in your version, split_free_page() is called without changing any pageblock
> > >> migratetypes, then split_free_page() is just a no-op, since the page is
> > >> just deleted from the free list, then freed via different orders. Buddy allocator
> > >> will merge them back.
> > >
> > > Hm not quite.
> > >
> > > If it's the tail block of a buddy, I update its type before
> > > splitting. The splitting loop looks up the type of each block for
> > > sorting it onto freelists.
> > >
> > > If it's the head block, yes I split it first according to its old
> > > type. But then I let it fall through to scanning the block, which will
> > > find that buddy, update its type and move it.
> > 
> > That is the issue, since split_free_page() assumes the pageblocks of
> > that free page have different types. It basically just free the page
> > with different small orders summed up to the original free page order.
> > If all pageblocks of the free page have the same migratetype, __free_one_page()
> > will merge these small order pages back to the original order free page.
> 
> duh, of course, you're right. Thanks for patiently explaining this.
> 
> > >> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> > >> but it causes free page accounting issues, since in the case of head, free pages
> > >> are deleted from new_mt when they are in old_mt free list and the accounting
> > >> decreases new_mt free page number instead of old_mt one.
> > >
> > > Right, that makes sense.
> > >
> > >> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> > >> which changes migratetypes without deleting the free pages from the list first.
> > >> That is why I came up with the new split_free_page() below.
> > >
> > > Yeah, the in-between thing is bad. Either it fixes the migratetype
> > > before deletion, or it doesn't do the deletion. I'm thinking it would
> > > be simpler to move the deletion out instead.
> > 
> > Yes and no. After deletion, a free page no longer has PageBuddy set and
> > has buddy_order information cleared. Either we reset PageBuddy and order
> > to the deleted free page, or split_free_page() needs to be changed to
> > accept pages without the information (basically remove the PageBuddy
> > and order check code).
> 
> Good point, that requires extra care.
> 
> It's correct in the code now, but it deserves a comment, especially
> because of the "buddy" naming in the new split function.
> 
> > >> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> > >> pageblock and in turn makes an in-use page have more than one pageblock,
> > >> we will have problems. Since in isolate_single_pageblock(), an in-use page
> > >> can have part of its pageblock set to a different migratetype and be freed,
> > >> causing the free page with unmatched migratetypes. We might need to
> > >> free pages at pageblock_order if their orders are bigger than pageblock_order.
> > >
> > > Is this a practical issue? You mentioned that right now only gigantic
> > > pages can be larger than a pageblock, and those are freed in order-0
> > > chunks.
> > 
> > Only if the system allocates a page (non hugetlb pages) with >pageblock_order
> > and frees it with the same order. I just do not know if such pages exist on
> > other arch than x86. Maybe I just think too much.
> 
> Hm, I removed LRU pages from the handling (and added the warning) but
> I left in PageMovable(). The only users are z3fold, zsmalloc and
> memory ballooning. AFAICS none of them can be bigger than a pageblock.
> Let me remove that and add a warning for that case as well.
> 
> This way, we only attempt to migrate hugetlb, where we know the free
> path - and get warnings for anything else that's larger than expected.
> 
> This seems like the safest option. On the off chance that there is a
> regression, it won't jeopardize anybody's systems, while the warning
> provides all the information we need to debug what's going on.

This delta on top?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5292ad9860c..0da7c61af37e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1628,7 +1628,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 }
 
 #ifdef CONFIG_MEMORY_ISOLATION
-/* Look for a multi-block buddy that straddles start_pfn */
+/* Look for a buddy that straddles start_pfn */
 static unsigned long find_large_buddy(unsigned long start_pfn)
 {
 	int order = 0;
@@ -1652,7 +1652,7 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
 	return start_pfn;
 }
 
-/* Split a multi-block buddy into its individual pageblocks */
+/* Split a multi-block free page into its individual pageblocks */
 static void split_large_buddy(struct zone *zone, struct page *page,
 			      unsigned long pfn, int order)
 {
@@ -1661,6 +1661,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	VM_WARN_ON_ONCE(order < pageblock_order);
 	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
 
+	/* Caller removed page from freelist, buddy info cleared! */
+	VM_WARN_ON_ONCE(PageBuddy(page));
+
 	while (pfn != end_pfn) {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b4d53545496d..c8b3c0699683 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -399,14 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				continue;
 			}
 
-			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-			/*
-			 * hugetlb, and movable compound pages can be
-			 * migrated. Otherwise, fail the isolation.
-			 */
-			if (PageHuge(page) || __PageMovable(page)) {
+			if (PageHuge(page)) {
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -426,9 +420,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				pfn = head_pfn + nr_pages;
 				continue;
-			} else
+			}
+
+			/*
+			 * These pages are movable too, but they're
+			 * not expected to exceed pageblock_order.
+			 *
+			 * Let us know when they do, so we can add
+			 * proper free and split handling for them.
+			 */
+			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
+			VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page);
 #endif
-				goto failed;
+			goto failed;
 		}
 
 		pfn++;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
  2023-10-16 20:39                                                               ` Johannes Weiner
@ 2023-10-16 20:48                                                                 ` Zi Yan
  0 siblings, 0 replies; 83+ messages in thread
From: Zi Yan @ 2023-10-16 20:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Vlastimil Babka, Mike Kravetz, Andrew Morton,
	Mel Gorman, Miaohe Lin, Kefeng Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7754 bytes --]

On 16 Oct 2023, at 16:39, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 04:26:30PM -0400, Johannes Weiner wrote:
>> On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
>>> On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
>>>
>>>> On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
>>>>> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
>>>>>
>>>>>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>>>>>>> The attached patch has all the suggested changes, let me know how it
>>>>>>>> looks to you. Thanks.
>>>>>>>
>>>>>>> The one I sent has free page accounting issues. The attached one fixes them.
>>>>>>
>>>>>> Do you still have the warnings? I wonder what went wrong.
>>>>>
>>>>> No warnings. But something with the code:
>>>>>
>>>>> 1. in your version, split_free_page() is called without changing any pageblock
>>>>> migratetypes, then split_free_page() is just a no-op, since the page is
>>>>> just deleted from the free list, then freed via different orders. Buddy allocator
>>>>> will merge them back.
>>>>
>>>> Hm not quite.
>>>>
>>>> If it's the tail block of a buddy, I update its type before
>>>> splitting. The splitting loop looks up the type of each block for
>>>> sorting it onto freelists.
>>>>
>>>> If it's the head block, yes I split it first according to its old
>>>> type. But then I let it fall through to scanning the block, which will
>>>> find that buddy, update its type and move it.
>>>
>>> That is the issue, since split_free_page() assumes the pageblocks of
>>> that free page have different types. It basically just free the page
>>> with different small orders summed up to the original free page order.
>>> If all pageblocks of the free page have the same migratetype, __free_one_page()
>>> will merge these small order pages back to the original order free page.
>>
>> duh, of course, you're right. Thanks for patiently explaining this.
>>
>>>>> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
>>>>> but it causes free page accounting issues, since in the case of head, free pages
>>>>> are deleted from new_mt when they are in old_mt free list and the accounting
>>>>> decreases new_mt free page number instead of old_mt one.
>>>>
>>>> Right, that makes sense.
>>>>
>>>>> Basically, split_free_page() is awkward as it relies on preset migratetypes,
>>>>> which changes migratetypes without deleting the free pages from the list first.
>>>>> That is why I came up with the new split_free_page() below.
>>>>
>>>> Yeah, the in-between thing is bad. Either it fixes the migratetype
>>>> before deletion, or it doesn't do the deletion. I'm thinking it would
>>>> be simpler to move the deletion out instead.
>>>
>>> Yes and no. After deletion, a free page no longer has PageBuddy set and
>>> has buddy_order information cleared. Either we reset PageBuddy and order
>>> to the deleted free page, or split_free_page() needs to be changed to
>>> accept pages without the information (basically remove the PageBuddy
>>> and order check code).
>>
>> Good point, that requires extra care.
>>
>> It's correct in the code now, but it deserves a comment, especially
>> because of the "buddy" naming in the new split function.
>>
>>>>> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
>>>>> pageblock and in turn makes an in-use page have more than one pageblock,
>>>>> we will have problems. Since in isolate_single_pageblock(), an in-use page
>>>>> can have part of its pageblock set to a different migratetype and be freed,
>>>>> causing the free page with unmatched migratetypes. We might need to
>>>>> free pages at pageblock_order if their orders are bigger than pageblock_order.
>>>>
>>>> Is this a practical issue? You mentioned that right now only gigantic
>>>> pages can be larger than a pageblock, and those are freed in order-0
>>>> chunks.
>>>
>>> Only if the system allocates a page (non hugetlb pages) with >pageblock_order
>>> and frees it with the same order. I just do not know if such pages exist on
>>> other arch than x86. Maybe I just think too much.
>>
>> Hm, I removed LRU pages from the handling (and added the warning) but
>> I left in PageMovable(). The only users are z3fold, zsmalloc and
>> memory ballooning. AFAICS none of them can be bigger than a pageblock.
>> Let me remove that and add a warning for that case as well.
>>
>> This way, we only attempt to migrate hugetlb, where we know the free
>> path - and get warnings for anything else that's larger than expected.
>>
>> This seems like the safest option. On the off chance that there is a
>> regression, it won't jeopardize anybody's systems, while the warning
>> provides all the information we need to debug what's going on.
>
> This delta on top?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5292ad9860c..0da7c61af37e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1628,7 +1628,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
>  }
>
>  #ifdef CONFIG_MEMORY_ISOLATION
> -/* Look for a multi-block buddy that straddles start_pfn */
> +/* Look for a buddy that straddles start_pfn */
>  static unsigned long find_large_buddy(unsigned long start_pfn)
>  {
>  	int order = 0;
> @@ -1652,7 +1652,7 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
>  	return start_pfn;
>  }
>
> -/* Split a multi-block buddy into its individual pageblocks */
> +/* Split a multi-block free page into its individual pageblocks */
>  static void split_large_buddy(struct zone *zone, struct page *page,
>  			      unsigned long pfn, int order)
>  {
> @@ -1661,6 +1661,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
>  	VM_WARN_ON_ONCE(order < pageblock_order);
>  	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
>
> +	/* Caller removed page from freelist, buddy info cleared! */
> +	VM_WARN_ON_ONCE(PageBuddy(page));
> +
>  	while (pfn != end_pfn) {
>  		int mt = get_pfnblock_migratetype(page, pfn);
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b4d53545496d..c8b3c0699683 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -399,14 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				continue;
>  			}
>
> -			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> -
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> -			/*
> -			 * hugetlb, and movable compound pages can be
> -			 * migrated. Otherwise, fail the isolation.
> -			 */
> -			if (PageHuge(page) || __PageMovable(page)) {
> +			if (PageHuge(page)) {
>  				struct compact_control cc = {
>  					.nr_migratepages = 0,
>  					.order = -1,
> @@ -426,9 +420,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>
>  				pfn = head_pfn + nr_pages;
>  				continue;
> -			} else
> +			}
> +
> +			/*
> +			 * These pages are movable too, but they're
> +			 * not expected to exceed pageblock_order.
> +			 *
> +			 * Let us know when they do, so we can add
> +			 * proper free and split handling for them.
> +			 */
> +			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> +			VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page);
>  #endif
> -				goto failed;
> +			goto failed;
>  		}
>
>  		pfn++;

LGTM.

I was thinking about adding

VM_WARN_ON_ONCE(order > pageblock_order, page);

in __free_pages() to catch all possible cases, but that is a really hot path.

And just for the record, we probably can easily fix the above warnings,
if they ever show up, by freeing >pageblock_order pages in unit of
pageblock_order.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2023-10-16 20:48 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-11 19:41 [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Johannes Weiner
2023-09-11 19:41 ` [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Johannes Weiner
2023-09-11 19:59   ` Zi Yan
2023-09-11 21:09     ` Andrew Morton
2023-09-12 13:47   ` Vlastimil Babka
2023-09-12 14:50     ` Johannes Weiner
2023-09-13  9:33       ` Vlastimil Babka
2023-09-13 13:24         ` Johannes Weiner
2023-09-13 13:34           ` Vlastimil Babka
2023-09-12 15:03     ` Johannes Weiner
2023-09-14  7:29       ` Vlastimil Babka
2023-09-14  9:56   ` Mel Gorman
2023-09-27  5:42   ` Huang, Ying
2023-09-27 14:51     ` Johannes Weiner
2023-09-30  4:26       ` Huang, Ying
2023-10-02 14:58         ` Johannes Weiner
2023-09-11 19:41 ` [PATCH 2/6] mm: page_alloc: fix up block types when merging compatible blocks Johannes Weiner
2023-09-11 20:01   ` Zi Yan
2023-09-13  9:52   ` Vlastimil Babka
2023-09-14 10:00   ` Mel Gorman
2023-09-11 19:41 ` [PATCH 3/6] mm: page_alloc: move free pages when converting block during isolation Johannes Weiner
2023-09-11 20:17   ` Zi Yan
2023-09-11 20:47     ` Johannes Weiner
2023-09-11 20:50       ` Zi Yan
2023-09-13 14:31   ` Vlastimil Babka
2023-09-14 10:03   ` Mel Gorman
2023-09-11 19:41 ` [PATCH 4/6] mm: page_alloc: fix move_freepages_block() range error Johannes Weiner
2023-09-11 20:23   ` Zi Yan
2023-09-13 14:40   ` Vlastimil Babka
2023-09-14 13:37     ` Johannes Weiner
2023-09-14 10:03   ` Mel Gorman
2023-09-11 19:41 ` [PATCH 5/6] mm: page_alloc: fix freelist movement during block conversion Johannes Weiner
2023-09-13 19:52   ` Vlastimil Babka
2023-09-14 14:47     ` Johannes Weiner
2023-09-11 19:41 ` [PATCH 6/6] mm: page_alloc: consolidate free page accounting Johannes Weiner
2023-09-13 20:18   ` Vlastimil Babka
2023-09-14  4:11     ` Johannes Weiner
2023-09-14 23:52 ` [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene Mike Kravetz
2023-09-15 14:16   ` Johannes Weiner
2023-09-15 15:05     ` Mike Kravetz
2023-09-16 19:57     ` Mike Kravetz
2023-09-16 20:13       ` Andrew Morton
2023-09-18  7:16       ` Vlastimil Babka
2023-09-18 14:52         ` Johannes Weiner
2023-09-18 17:40           ` Mike Kravetz
2023-09-19  6:49             ` Johannes Weiner
2023-09-19 12:37               ` Zi Yan
2023-09-19 15:22                 ` Zi Yan
2023-09-19 18:47               ` Mike Kravetz
2023-09-19 20:57                 ` Zi Yan
2023-09-20  0:32                   ` Mike Kravetz
2023-09-20  1:38                     ` Zi Yan
2023-09-20  6:07                       ` Vlastimil Babka
2023-09-20 13:48                         ` Johannes Weiner
2023-09-20 16:04                           ` Johannes Weiner
2023-09-20 17:23                             ` Zi Yan
2023-09-21  2:31                               ` Zi Yan
2023-09-21 10:19                                 ` David Hildenbrand
2023-09-21 14:47                                   ` Zi Yan
2023-09-25 21:12                                     ` Zi Yan
2023-09-26 17:39                                       ` Johannes Weiner
2023-09-28  2:51                                         ` Zi Yan
2023-10-03  2:26                                           ` Zi Yan
2023-10-10 21:12                                             ` Johannes Weiner
2023-10-11 15:25                                               ` Johannes Weiner
2023-10-11 15:45                                                 ` Johannes Weiner
2023-10-11 15:57                                                   ` Zi Yan
2023-10-13  0:06                                               ` Zi Yan
2023-10-13 14:51                                                 ` Zi Yan
2023-10-16 13:35                                                   ` Zi Yan
2023-10-16 14:37                                                     ` Johannes Weiner
2023-10-16 15:00                                                       ` Zi Yan
2023-10-16 18:51                                                         ` Johannes Weiner
2023-10-16 19:49                                                           ` Zi Yan
2023-10-16 20:26                                                             ` Johannes Weiner
2023-10-16 20:39                                                               ` Johannes Weiner
2023-10-16 20:48                                                                 ` Zi Yan
2023-09-26 18:19                                     ` David Hildenbrand
2023-09-28  3:22                                       ` Zi Yan
2023-10-02 11:43                                         ` David Hildenbrand
2023-10-03  2:35                                           ` Zi Yan
2023-09-18  7:07     ` Vlastimil Babka
2023-09-18 14:09       ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.