All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
@ 2024-02-29 18:34 Yu Zhao
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Yu Zhao @ 2024-02-29 18:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Jonathan Corbet, Yu Zhao

TAO is an umbrella project aiming at a better economy of physical
contiguity viewed as a valuable resource. A few examples are:
1. A multi-tenant system can have guaranteed THP coverage while
   hosting abusers/misusers of the resource.
2. Abusers/misusers, e.g., workloads excessively requesting and then
   splitting THPs, should be punished if necessary.
3. Good citizens should be awarded with, e.g., lower allocation
   latency and less cost of metadata (struct page).
4. Better interoperability with userspace memory allocators when
   transacting the resource.

This project puts the same emphasis on the established use case for
servers and the emerging use case for clients so that client workloads
like Android and ChromeOS can leverage the recent multi-sized THPs
[1][2].

Chapter One introduces the cornerstone of TAO: an abstraction called
policy (virtual) zones, which are overlayed on the physical zones.
This is in line with item 1 above.

A new door is open after Chapter One. The following two chapters
discuss the reverse of THP collapsing, called THP shattering, and THP
HVO, which brings the hugeTLB feature [3] to THP. They are in line
with items 2 & 3 above.

Advanced use cases are discussed in Epilogue, since they require the
cooperation of userspace memory allocators. This is in line with item
4 above.

[1] https://lwn.net/Articles/932386/
[2] https://lwn.net/Articles/937239/
[3] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html

Yu Zhao (4):
  THP zones: the use cases of policy zones
  THP shattering: the reverse of collapsing
  THP HVO: bring the hugeTLB feature to THP
  Profile-Guided Heap Optimization and THP fungibility

 .../admin-guide/kernel-parameters.txt         |  10 +
 drivers/virtio/virtio_mem.c                   |   2 +-
 include/linux/gfp.h                           |  24 +-
 include/linux/huge_mm.h                       |   6 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mempolicy.h                     |   2 +-
 include/linux/mm.h                            | 140 ++++++
 include/linux/mm_inline.h                     |  24 +
 include/linux/mm_types.h                      |   8 +-
 include/linux/mmzone.h                        |  53 +-
 include/linux/nodemask.h                      |   2 +-
 include/linux/rmap.h                          |   4 +
 include/linux/vm_event_item.h                 |   5 +-
 include/trace/events/mmflags.h                |   4 +-
 init/main.c                                   |   1 +
 mm/compaction.c                               |  12 +
 mm/gup.c                                      |   3 +-
 mm/huge_memory.c                              | 304 ++++++++++--
 mm/hugetlb_vmemmap.c                          |   2 +-
 mm/internal.h                                 |  47 +-
 mm/madvise.c                                  |  11 +-
 mm/memcontrol.c                               |  47 ++
 mm/memory-failure.c                           |   2 +-
 mm/memory.c                                   |  11 +-
 mm/mempolicy.c                                |  14 +-
 mm/migrate.c                                  |  51 +-
 mm/mm_init.c                                  | 452 ++++++++++--------
 mm/page_alloc.c                               | 199 +++++++-
 mm/page_isolation.c                           |   2 +-
 mm/rmap.c                                     |  21 +-
 mm/shmem.c                                    |   4 +-
 mm/swap_slots.c                               |   3 +-
 mm/truncate.c                                 |   6 +-
 mm/userfaultfd.c                              |   2 +-
 mm/vmscan.c                                   |  41 +-
 mm/vmstat.c                                   |  12 +-
 36 files changed, 1194 insertions(+), 342 deletions(-)


base-commit: d206a76d7d2726f3b096037f2079ce0bd3ba329b
-- 
2.44.0.rc1.240.g4c46232300-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
@ 2024-02-29 18:34 ` Yu Zhao
  2024-02-29 20:28   ` Matthew Wilcox
                     ` (4 more replies)
  2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
                   ` (5 subsequent siblings)
  6 siblings, 5 replies; 28+ messages in thread
From: Yu Zhao @ 2024-02-29 18:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Jonathan Corbet, Yu Zhao

There are three types of zones:
1. The first four zones partition the physical address space of CPU
   memory.
2. The device zone provides interoperability between CPU and device
   memory.
3. The movable zone commonly represents a memory allocation policy.

Though originally designed for memory hot removal, the movable zone is
instead widely used for other purposes, e.g., CMA and kdump kernel, on
platforms that do not support hot removal, e.g., Android and ChromeOS.
Nowadays, it is legitimately a zone independent of any physical
characteristics. In spite of being somewhat regarded as a hack,
largely due to the lack of a generic design concept for its true major
use cases (on billions of client devices), the movable zone naturally
resembles a policy (virtual) zone overlayed on the first four
(physical) zones.

This proposal formally generalizes this concept as policy zones so
that additional policies can be implemented and enforced by subsequent
zones after the movable zone. An inherited requirement of policy zones
(and the first four zones) is that subsequent zones must be able to
fall back to previous zones and therefore must add new properties to
the previous zones rather than remove existing ones from them. Also,
all properties must be known at the allocation time, rather than the
runtime, e.g., memory object size and mobility are valid properties
but hotness and lifetime are not.

ZONE_MOVABLE becomes the first policy zone, followed by two new policy
zones:
1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
   ZONE_MOVABLE) and restricted to a minimum order to be
   anti-fragmentation. The latter means that they cannot be split down
   below that order, while they are free or in use.
2. ZONE_NOMERGE, which contains pages that are movable and restricted
   to an exact order. The latter means that not only is split
   prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
   reason in Chapter Three), while they are free or in use.

Since these two zones only can serve THP allocations (__GFP_MOVABLE |
__GFP_COMP), they are called THP zones. Reclaim works seamlessly and
compaction is not needed for these two zones.

Compared with the hugeTLB pool approach, THP zones tap into core MM
features including:
1. THP allocations can fall back to the lower zones, which can have
   higher latency but still succeed.
2. THPs can be either shattered (see Chapter Two) if partially
   unmapped or reclaimed if becoming cold.
3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
   contiguous PTEs on arm64 [1], which are more suitable for client
   workloads.

Policy zones can be dynamically resized by offlining pages in one of
them and onlining those pages in another of them. Note that this is
only done among policy zones, not between a policy zone and a physical
zone, since resizing is a (software) policy, not a physical
characteristic.

Implementing the same idea in the pageblock granularity has also been
explored but rejected at Google. Pageblocks have a finer granularity
and therefore can be more flexible than zones. The tradeoff is that
this alternative implementation was more complex and failed to bring a
better ROI. However, the rejection was mainly due to its inability to
be smoothly extended to 1GB THPs [2], which is a planned use case of
TAO.

[1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 drivers/virtio/virtio_mem.c                   |   2 +-
 include/linux/gfp.h                           |  24 +-
 include/linux/huge_mm.h                       |   6 -
 include/linux/mempolicy.h                     |   2 +-
 include/linux/mmzone.h                        |  52 +-
 include/linux/nodemask.h                      |   2 +-
 include/linux/vm_event_item.h                 |   2 +-
 include/trace/events/mmflags.h                |   4 +-
 mm/compaction.c                               |  12 +
 mm/huge_memory.c                              |   5 +-
 mm/mempolicy.c                                |  14 +-
 mm/migrate.c                                  |   7 +-
 mm/mm_init.c                                  | 452 ++++++++++--------
 mm/page_alloc.c                               |  44 +-
 mm/page_isolation.c                           |   2 +-
 mm/swap_slots.c                               |   3 +-
 mm/vmscan.c                                   |  32 +-
 mm/vmstat.c                                   |   7 +-
 19 files changed, 431 insertions(+), 251 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 31b3a25680d0..a6c181f6efde 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3529,6 +3529,16 @@
 			allocations which rules out almost all kernel
 			allocations. Use with caution!
 
+	nosplit=X,Y	[MM] Set the minimum order of the nosplit zone. Pages in
+			this zone can't be split down below order Y, while free
+			or in use.
+			Like movablecore, X should be either nn[KMGTPE] or n%.
+
+	nomerge=X,Y	[MM] Set the exact orders of the nomerge zone. Pages in
+			this zone are always order Y, meaning they can't be
+			split or merged while free or in use.
+			Like movablecore, X should be either nn[KMGTPE] or n%.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8e3223294442..37ecf5ee4afd 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm,
 		page = pfn_to_online_page(pfn);
 		if (!page)
 			continue;
-		if (page_zonenum(page) != ZONE_MOVABLE)
+		if (!is_zone_movable_page(page))
 			return false;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..c0f9d21b4d18 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
  */
 
-#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
-/* ZONE_DEVICE is not a valid GFP zone specifier */
+#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4
+/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */
 #define GFP_ZONES_SHIFT 2
 #else
 #define GFP_ZONES_SHIFT ZONES_SHIFT
@@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags)
 	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
 					 ((1 << GFP_ZONES_SHIFT) - 1);
 	VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
+
+	if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP))
+		return LAST_VIRT_ZONE;
+
 	return z;
 }
 
+extern int zone_nomerge_order __read_mostly;
+extern int zone_nosplit_order __read_mostly;
+
+static inline enum zone_type gfp_order_zone(gfp_t flags, int order)
+{
+	enum zone_type zid = gfp_zone(flags);
+
+	if (zid >= ZONE_NOMERGE && order != zone_nomerge_order)
+		zid = ZONE_NOMERGE - 1;
+
+	if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order)
+		zid = ZONE_NOSPLIT - 1;
+
+	return zid;
+}
+
 /*
  * There is only one page-allocator function, and two main namespaces to
  * it. The alloc_page*() variants return 'struct page *' and as such
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..9960ad7c3b10 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
 void folio_prep_large_rmappable(struct folio *folio);
-bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {}
 
 #define thp_get_unmapped_area	NULL
 
-static inline bool
-can_split_folio(struct folio *folio, int *pextra_pins)
-{
-	return false;
-}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..a92bcf47cf8c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -150,7 +150,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone && k != ZONE_MOVABLE)
+	if (k > policy_zone && !zid_is_virt(k))
 		policy_zone = k;
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a497f189d988..532218167bba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -805,11 +805,15 @@ enum zone_type {
 	 * there can be false negatives).
 	 */
 	ZONE_MOVABLE,
+	ZONE_NOSPLIT,
+	ZONE_NOMERGE,
 #ifdef CONFIG_ZONE_DEVICE
 	ZONE_DEVICE,
 #endif
-	__MAX_NR_ZONES
+	__MAX_NR_ZONES,
 
+	LAST_PHYS_ZONE = ZONE_MOVABLE - 1,
+	LAST_VIRT_ZONE = ZONE_NOMERGE,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -929,6 +933,8 @@ struct zone {
 	seqlock_t		span_seqlock;
 #endif
 
+	int order;
+
 	int initialized;
 
 	/* Write-intensive fields used from the page allocator */
@@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio)
 
 static inline bool is_zone_movable_page(const struct page *page)
 {
-	return page_zonenum(page) == ZONE_MOVABLE;
+	return page_zonenum(page) >= ZONE_MOVABLE;
 }
 
 static inline bool folio_is_zone_movable(const struct folio *folio)
 {
-	return folio_zonenum(folio) == ZONE_MOVABLE;
+	return folio_zonenum(folio) >= ZONE_MOVABLE;
+}
+
+static inline bool page_can_split(struct page *page)
+{
+	return page_zonenum(page) < ZONE_NOSPLIT;
+}
+
+static inline bool folio_can_split(struct folio *folio)
+{
+	return folio_zonenum(folio) < ZONE_NOSPLIT;
 }
 #endif
 
@@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; };
  */
 #define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
 
+static inline bool zid_is_virt(enum zone_type zid)
+{
+	return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE;
+}
+
+static inline bool zone_can_frag(struct zone *zone)
+{
+	VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT);
+
+	return zone_idx(zone) < ZONE_NOSPLIT;
+}
+
+static inline bool zone_is_suitable(struct zone *zone, int order)
+{
+	int zid = zone_idx(zone);
+
+	if (zid < ZONE_NOSPLIT)
+		return true;
+
+	if (!zone->order)
+		return false;
+
+	return (zid == ZONE_NOSPLIT && order >= zone->order) ||
+	       (zid == ZONE_NOMERGE && order == zone->order);
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 static inline bool zone_is_zone_device(struct zone *zone)
 {
@@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone)
 static inline void zone_set_nid(struct zone *zone, int nid) {}
 #endif
 
-extern int movable_zone;
+extern int virt_zone;
 
 static inline int is_highmem_idx(enum zone_type idx)
 {
 #ifdef CONFIG_HIGHMEM
 	return (idx == ZONE_HIGHMEM ||
-		(idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM));
+		(zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM));
 #else
 	return 0;
 #endif
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index b61438313a73..34fbe910576d 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -404,7 +404,7 @@ enum node_states {
 #else
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
-	N_MEMORY,		/* The node has memory(regular, high, movable) */
+	N_MEMORY,		/* The node has memory in any of the zones */
 	N_CPU,		/* The node has one or more cpus */
 	N_GENERIC_INITIATOR,	/* The node has one or more Generic Initiators */
 	NR_NODE_STATES
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..9a54d15d5ec3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -27,7 +27,7 @@
 #endif
 
 #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \
-	HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
+	HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx)
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index d801409b33cf..2b5fdafaadea 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	IFDEF_ZONE_DMA32(	EM (ZONE_DMA32,	 "DMA32"))	\
 				EM (ZONE_NORMAL, "Normal")	\
 	IFDEF_ZONE_HIGHMEM(	EM (ZONE_HIGHMEM,"HighMem"))	\
-				EMe(ZONE_MOVABLE,"Movable")
+				EM (ZONE_MOVABLE,"Movable")	\
+				EM (ZONE_NOSPLIT,"NoSplit")	\
+				EMe(ZONE_NOMERGE,"NoMerge")
 
 #define LRU_NAMES		\
 		EM (LRU_INACTIVE_ANON, "inactive_anon") \
diff --git a/mm/compaction.c b/mm/compaction.c
index 4add68d40e8d..8a64c805f411 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 					ac->highest_zoneidx, ac->nodemask) {
 		enum compact_result status;
 
+		if (!zone_can_frag(zone))
+			continue;
+
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
@@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
+		if (!zone_can_frag(zone))
+			continue;
+
 		cc.zone = zone;
 
 		compact_zone(&cc, NULL);
@@ -2846,6 +2852,9 @@ static void compact_node(int nid)
 		if (!populated_zone(zone))
 			continue;
 
+		if (!zone_can_frag(zone))
+			continue;
+
 		cc.zone = zone;
 
 		compact_zone(&cc, NULL);
@@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
+		if (!zone_can_frag(zone))
+			continue;
+
 		ret = compaction_suit_allocation_order(zone,
 				pgdat->kcompactd_max_order,
 				highest_zoneidx, ALLOC_WMARK_MIN);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..b57faa0a1e83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 }
 
 /* Racy check whether the huge page can be split */
-bool can_split_folio(struct folio *folio, int *pextra_pins)
+static bool can_split_folio(struct folio *folio, int *pextra_pins)
 {
 	int extra_pins;
 
+	if (!folio_can_split(folio))
+		return false;
+
 	/* Additional pins from page cache */
 	if (folio_test_anon(folio))
 		extra_pins = folio_test_swapcache(folio) ?
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..1f84dd759086 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma)
 
 bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 {
-	enum zone_type dynamic_policy_zone = policy_zone;
-
-	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+	WARN_ON_ONCE(zid_is_virt(policy_zone));
 
 	/*
-	 * if policy->nodes has movable memory only,
-	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+	 * If policy->nodes has memory in virtual zones only, we apply policy
+	 * only if gfp_zone(gfp) can allocate from those zones.
 	 *
 	 * policy->nodes is intersect with node_states[N_MEMORY].
 	 * so if the following test fails, it implies
-	 * policy->nodes has movable memory only.
+	 * policy->nodes has memory in virtual zones only.
 	 */
 	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
-		dynamic_policy_zone = ZONE_MOVABLE;
+		return zone > LAST_PHYS_ZONE;
 
-	return zone >= dynamic_policy_zone;
+	return zone >= policy_zone;
 }
 
 /* Do dynamic interleaving for a process */
diff --git a/mm/migrate.c b/mm/migrate.c
index cc9f2bcd73b4..f615c0c22046 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 {
 	int rc;
 
+	if (!folio_can_split(folio))
+		return -EBUSY;
+
 	folio_lock(folio);
 	rc = split_folio_to_list(folio, split_folios);
 	folio_unlock(folio);
@@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 		order = folio_order(src);
 	}
 	zidx = zone_idx(folio_zone(src));
-	if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
+	if (zidx > ZONE_NORMAL)
 		gfp_mask |= __GFP_HIGHMEM;
 
 	return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
@@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
 				break;
 		}
 		wakeup_kswapd(pgdat->node_zones + z, 0,
-			      folio_order(folio), ZONE_MOVABLE);
+			      folio_order(folio), z);
 		return 0;
 	}
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36..7769c21e6d54 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init);
 
 static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
 static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
-static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
 
-static unsigned long required_kernelcore __initdata;
-static unsigned long required_kernelcore_percent __initdata;
-static unsigned long required_movablecore __initdata;
-static unsigned long required_movablecore_percent __initdata;
+static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata;
+#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid])
+
+static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
+#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE])
+
+static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
+#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE])
+
+int zone_nosplit_order __read_mostly;
+int zone_nomerge_order __read_mostly;
 
 static unsigned long nr_kernel_pages __initdata;
 static unsigned long nr_all_pages __initdata;
@@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p)
 		return 0;
 	}
 
-	return cmdline_parse_core(p, &required_kernelcore,
-				  &required_kernelcore_percent);
+	return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE),
+				  &percentage_of(LAST_PHYS_ZONE));
 }
 early_param("kernelcore", cmdline_parse_kernelcore);
 
@@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore);
  */
 static int __init cmdline_parse_movablecore(char *p)
 {
-	return cmdline_parse_core(p, &required_movablecore,
-				  &required_movablecore_percent);
+	return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE),
+				  &percentage_of(ZONE_MOVABLE));
 }
 early_param("movablecore", cmdline_parse_movablecore);
 
+static int __init parse_zone_order(char *p, unsigned long *nr_pages,
+				   unsigned long *percent, int *order)
+{
+	int err;
+	unsigned long n;
+	char *s = strchr(p, ',');
+
+	if (!s)
+		return -EINVAL;
+
+	*s++ = '\0';
+
+	err = kstrtoul(s, 0, &n);
+	if (err)
+		return err;
+
+	if (n < 2 || n > MAX_PAGE_ORDER)
+		return -EINVAL;
+
+	err = cmdline_parse_core(p, nr_pages, percent);
+	if (err)
+		return err;
+
+	*order = n;
+
+	return 0;
+}
+
+static int __init parse_zone_nosplit(char *p)
+{
+	return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT),
+				&percentage_of(ZONE_NOSPLIT), &zone_nosplit_order);
+}
+early_param("nosplit", parse_zone_nosplit);
+
+static int __init parse_zone_nomerge(char *p)
+{
+	return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE),
+				&percentage_of(ZONE_NOMERGE), &zone_nomerge_order);
+}
+early_param("nomerge", parse_zone_nomerge);
+
 /*
  * early_calculate_totalpages()
- * Sum pages in active regions for movable zone.
+ * Sum pages in active regions for virtual zones.
  * Populate N_MEMORY for calculating usable_nodes.
  */
 static unsigned long __init early_calculate_totalpages(void)
@@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void)
 }
 
 /*
- * This finds a zone that can be used for ZONE_MOVABLE pages. The
+ * This finds a physical zone that can be used for virtual zones. The
  * assumption is made that zones within a node are ordered in monotonic
  * increasing memory addresses so that the "highest" populated zone is used
  */
-static void __init find_usable_zone_for_movable(void)
+static void __init find_usable_zone(void)
 {
 	int zone_index;
-	for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
-		if (zone_index == ZONE_MOVABLE)
-			continue;
-
+	for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) {
 		if (arch_zone_highest_possible_pfn[zone_index] >
 				arch_zone_lowest_possible_pfn[zone_index])
 			break;
 	}
 
 	VM_BUG_ON(zone_index == -1);
-	movable_zone = zone_index;
+	virt_zone = zone_index;
+}
+
+static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn)
+{
+	int i, nid;
+	unsigned long node_avg, remaining;
+	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	/* usable_startpfn is the lowest possible pfn virtual zones can be at */
+	unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone];
+
+restart:
+	/* Carve out memory as evenly as possible throughout nodes */
+	node_avg = occupied / usable_nodes;
+	for_each_node_state(nid, N_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		/*
+		 * Recalculate node_avg if the division per node now exceeds
+		 * what is necessary to satisfy the amount of memory to carve
+		 * out.
+		 */
+		if (occupied < node_avg)
+			node_avg = occupied / usable_nodes;
+
+		/*
+		 * As the map is walked, we track how much memory is usable
+		 * using remaining. When it is 0, the rest of the node is
+		 * usable.
+		 */
+		remaining = node_avg;
+
+		/* Go through each range of PFNs within this node */
+		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+			unsigned long size_pages;
+
+			start_pfn = max(start_pfn, zone_pfn[nid]);
+			if (start_pfn >= end_pfn)
+				continue;
+
+			/* Account for what is only usable when carving out */
+			if (start_pfn < usable_startpfn) {
+				unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn;
+
+				remaining -= min(nr_pages, remaining);
+				occupied -= min(nr_pages, occupied);
+
+				/* Continue if range is now fully accounted */
+				if (end_pfn <= usable_startpfn) {
+
+					/*
+					 * Push zone_pfn to the end so that if
+					 * we have to carve out more across
+					 * nodes, we will not double account
+					 * here.
+					 */
+					zone_pfn[nid] = end_pfn;
+					continue;
+				}
+				start_pfn = usable_startpfn;
+			}
+
+			/*
+			 * The usable PFN range is from start_pfn->end_pfn.
+			 * Calculate size_pages as the number of pages used.
+			 */
+			size_pages = end_pfn - start_pfn;
+			if (size_pages > remaining)
+				size_pages = remaining;
+			zone_pfn[nid] = start_pfn + size_pages;
+
+			/*
+			 * Some memory was carved out, update counts and break
+			 * if the request for this node has been satisfied.
+			 */
+			occupied -= min(occupied, size_pages);
+			remaining -= size_pages;
+			if (!remaining)
+				break;
+		}
+	}
+
+	/*
+	 * If there is still more to carve out, we do another pass with one less
+	 * node in the count. This will push zone_pfn[nid] further along on the
+	 * nodes that still have memory until the request is fully satisfied.
+	 */
+	usable_nodes--;
+	if (usable_nodes && occupied > usable_nodes)
+		goto restart;
 }
 
 /*
@@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void)
  * memory. When they don't, some nodes will have more kernelcore than
  * others
  */
-static void __init find_zone_movable_pfns_for_nodes(void)
+static void __init find_virt_zones(void)
 {
-	int i, nid;
+	int i;
+	int nid;
 	unsigned long usable_startpfn;
-	unsigned long kernelcore_node, kernelcore_remaining;
 	/* save the state before borrow the nodemask */
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
-	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
 	struct memblock_region *r;
+	unsigned long occupied = 0;
 
-	/* Need to find movable_zone earlier when movable_node is specified. */
-	find_usable_zone_for_movable();
+	/* Need to find virt_zone earlier when movable_node is specified. */
+	find_usable_zone();
 
 	/*
 	 * If movable_node is specified, ignore kernelcore and movablecore
@@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 			nid = memblock_get_region_node(r);
 
 			usable_startpfn = PFN_DOWN(r->base);
-			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
-				min(usable_startpfn, zone_movable_pfn[nid]) :
+			pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
+				min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
 				usable_startpfn;
 		}
 
@@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 				continue;
 			}
 
-			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
-				min(usable_startpfn, zone_movable_pfn[nid]) :
+			pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
+				min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
 				usable_startpfn;
 		}
 
@@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out2;
 	}
 
+	if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) {
+		nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0;
+		percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0;
+		zone_nosplit_order = zone_nomerge_order = 0;
+		pr_warn("zone %s order %d must be higher zone %s order %d\n",
+			zone_names[ZONE_NOMERGE], zone_nomerge_order,
+			zone_names[ZONE_NOSPLIT], zone_nosplit_order);
+	}
+
 	/*
 	 * If kernelcore=nn% or movablecore=nn% was specified, calculate the
 	 * amount of necessary memory.
 	 */
-	if (required_kernelcore_percent)
-		required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
-				       10000UL;
-	if (required_movablecore_percent)
-		required_movablecore = (totalpages * 100 * required_movablecore_percent) /
-					10000UL;
+	for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) {
+		if (percentage_of(i))
+			nr_pages_of(i) = totalpages * percentage_of(i) / 100;
+
+		nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES);
+		occupied += nr_pages_of(i);
+	}
 
 	/*
 	 * If movablecore= was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
-	 * will be used for required_kernelcore if it's greater than
-	 * what movablecore would have allowed.
+	 * will be used if it's greater than what movablecore would have
+	 * allowed.
 	 */
-	if (required_movablecore) {
-		unsigned long corepages;
+	if (occupied < totalpages) {
+		enum zone_type zid;
 
-		/*
-		 * Round-up so that ZONE_MOVABLE is at least as large as what
-		 * was requested by the user
-		 */
-		required_movablecore =
-			roundup(required_movablecore, MAX_ORDER_NR_PAGES);
-		required_movablecore = min(totalpages, required_movablecore);
-		corepages = totalpages - required_movablecore;
-
-		required_kernelcore = max(required_kernelcore, corepages);
+		zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ?
+		      LAST_PHYS_ZONE : ZONE_MOVABLE;
+		nr_pages_of(zid) += totalpages - occupied;
 	}
 
 	/*
 	 * If kernelcore was not specified or kernelcore size is larger
-	 * than totalpages, there is no ZONE_MOVABLE.
+	 * than totalpages, there are not virtual zones.
 	 */
-	if (!required_kernelcore || required_kernelcore >= totalpages)
+	occupied = nr_pages_of(LAST_PHYS_ZONE);
+	if (!occupied || occupied >= totalpages)
 		goto out;
 
-	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
+	for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
+		if (!nr_pages_of(i))
+			continue;
 
-restart:
-	/* Spread kernelcore memory as evenly as possible throughout nodes */
-	kernelcore_node = required_kernelcore / usable_nodes;
-	for_each_node_state(nid, N_MEMORY) {
-		unsigned long start_pfn, end_pfn;
-
-		/*
-		 * Recalculate kernelcore_node if the division per node
-		 * now exceeds what is necessary to satisfy the requested
-		 * amount of memory for the kernel
-		 */
-		if (required_kernelcore < kernelcore_node)
-			kernelcore_node = required_kernelcore / usable_nodes;
-
-		/*
-		 * As the map is walked, we track how much memory is usable
-		 * by the kernel using kernelcore_remaining. When it is
-		 * 0, the rest of the node is usable by ZONE_MOVABLE
-		 */
-		kernelcore_remaining = kernelcore_node;
-
-		/* Go through each range of PFNs within this node */
-		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
-			unsigned long size_pages;
-
-			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
-			if (start_pfn >= end_pfn)
-				continue;
-
-			/* Account for what is only usable for kernelcore */
-			if (start_pfn < usable_startpfn) {
-				unsigned long kernel_pages;
-				kernel_pages = min(end_pfn, usable_startpfn)
-								- start_pfn;
-
-				kernelcore_remaining -= min(kernel_pages,
-							kernelcore_remaining);
-				required_kernelcore -= min(kernel_pages,
-							required_kernelcore);
-
-				/* Continue if range is now fully accounted */
-				if (end_pfn <= usable_startpfn) {
-
-					/*
-					 * Push zone_movable_pfn to the end so
-					 * that if we have to rebalance
-					 * kernelcore across nodes, we will
-					 * not double account here
-					 */
-					zone_movable_pfn[nid] = end_pfn;
-					continue;
-				}
-				start_pfn = usable_startpfn;
-			}
-
-			/*
-			 * The usable PFN range for ZONE_MOVABLE is from
-			 * start_pfn->end_pfn. Calculate size_pages as the
-			 * number of pages used as kernelcore
-			 */
-			size_pages = end_pfn - start_pfn;
-			if (size_pages > kernelcore_remaining)
-				size_pages = kernelcore_remaining;
-			zone_movable_pfn[nid] = start_pfn + size_pages;
-
-			/*
-			 * Some kernelcore has been met, update counts and
-			 * break if the kernelcore for this node has been
-			 * satisfied
-			 */
-			required_kernelcore -= min(required_kernelcore,
-								size_pages);
-			kernelcore_remaining -= size_pages;
-			if (!kernelcore_remaining)
-				break;
-		}
+		find_virt_zone(occupied, &pfn_of(i, 0));
+		occupied += nr_pages_of(i);
 	}
-
-	/*
-	 * If there is still required_kernelcore, we do another pass with one
-	 * less node in the count. This will push zone_movable_pfn[nid] further
-	 * along on the nodes that still have memory until kernelcore is
-	 * satisfied
-	 */
-	usable_nodes--;
-	if (usable_nodes && required_kernelcore > usable_nodes)
-		goto restart;
-
 out2:
-	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
+	/* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++) {
 		unsigned long start_pfn, end_pfn;
-
-		zone_movable_pfn[nid] =
-			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
+		unsigned long prev_virt_zone_pfn = 0;
 
 		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
-		if (zone_movable_pfn[nid] >= end_pfn)
-			zone_movable_pfn[nid] = 0;
+
+		for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
+			pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES);
+
+			if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn)
+				pfn_of(i, nid) = 0;
+
+			if (pfn_of(i, nid))
+				prev_virt_zone_pfn = pfn_of(i, nid);
+		}
 	}
-
 out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
@@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone,
 #endif
 
 /*
- * The zone ranges provided by the architecture do not include ZONE_MOVABLE
- * because it is sized independent of architecture. Unlike the other zones,
- * the starting point for ZONE_MOVABLE is not fixed. It may be different
- * in each node depending on the size of each node and how evenly kernelcore
- * is distributed. This helper function adjusts the zone ranges
+ * The zone ranges provided by the architecture do not include virtual zones
+ * because they are sized independent of architecture. Unlike physical zones,
+ * the starting point for the first populated virtual zone is not fixed. It may
+ * be different in each node depending on the size of each node and how evenly
+ * kernelcore is distributed. This helper function adjusts the zone ranges
  * provided by the architecture for a given node by using the end of the
- * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
- * zones within a node are in order of monotonic increases memory addresses
+ * highest usable zone for the first populated virtual zone. This preserves the
+ * assumption that zones within a node are in order of monotonic increases
+ * memory addresses.
  */
-static void __init adjust_zone_range_for_zone_movable(int nid,
+static void __init adjust_zone_range(int nid,
 					unsigned long zone_type,
 					unsigned long node_end_pfn,
 					unsigned long *zone_start_pfn,
 					unsigned long *zone_end_pfn)
 {
-	/* Only adjust if ZONE_MOVABLE is on this node */
-	if (zone_movable_pfn[nid]) {
-		/* Size ZONE_MOVABLE */
-		if (zone_type == ZONE_MOVABLE) {
-			*zone_start_pfn = zone_movable_pfn[nid];
-			*zone_end_pfn = min(node_end_pfn,
-				arch_zone_highest_possible_pfn[movable_zone]);
+	int i = max_t(int, zone_type, LAST_PHYS_ZONE);
+	unsigned long next_virt_zone_pfn = 0;
 
-		/* Adjust for ZONE_MOVABLE starting within this range */
-		} else if (!mirrored_kernelcore &&
-			*zone_start_pfn < zone_movable_pfn[nid] &&
-			*zone_end_pfn > zone_movable_pfn[nid]) {
-			*zone_end_pfn = zone_movable_pfn[nid];
+	while (i++ < LAST_VIRT_ZONE) {
+		if (pfn_of(i, nid)) {
+			next_virt_zone_pfn = pfn_of(i, nid);
+			break;
+		}
+	}
 
-		/* Check if this whole range is within ZONE_MOVABLE */
-		} else if (*zone_start_pfn >= zone_movable_pfn[nid])
+	if (zone_type <= LAST_PHYS_ZONE) {
+		if (!next_virt_zone_pfn)
+			return;
+
+		if (!mirrored_kernelcore &&
+		    *zone_start_pfn < next_virt_zone_pfn &&
+		    *zone_end_pfn > next_virt_zone_pfn)
+			*zone_end_pfn = next_virt_zone_pfn;
+		else if (*zone_start_pfn >= next_virt_zone_pfn)
 			*zone_start_pfn = *zone_end_pfn;
+	} else if (zone_type <= LAST_VIRT_ZONE) {
+		if (!pfn_of(zone_type, nid))
+			return;
+
+		if (next_virt_zone_pfn)
+			*zone_end_pfn = min3(next_virt_zone_pfn,
+					     node_end_pfn,
+					     arch_zone_highest_possible_pfn[virt_zone]);
+		else
+			*zone_end_pfn = min(node_end_pfn,
+					    arch_zone_highest_possible_pfn[virt_zone]);
+		*zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid));
 	}
 }
 
@@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
 	 * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
 	 * and vice versa.
 	 */
-	if (mirrored_kernelcore && zone_movable_pfn[nid]) {
+	if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) {
 		unsigned long start_pfn, end_pfn;
 		struct memblock_region *r;
 
@@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
 	/* Get the start and end of the zone */
 	*zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
 	*zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
-	adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn,
-					   zone_start_pfn, zone_end_pfn);
+	adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn);
 
 	/* Check that this node has pages within the zone's required range */
 	if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
@@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 #if defined(CONFIG_MEMORY_HOTPLUG)
 		zone->present_early_pages = real_size;
 #endif
+		if (i == ZONE_NOSPLIT)
+			zone->order = zone_nosplit_order;
+		if (i == ZONE_NOMERGE)
+			zone->order = zone_nomerge_order;
 
 		totalpages += spanned;
 		realtotalpages += real_size;
@@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat)
 {
 	enum zone_type zone_type;
 
-	for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
+	for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) {
 		struct zone *zone = &pgdat->node_zones[zone_type];
 		if (populated_zone(zone)) {
 			if (IS_ENABLED(CONFIG_HIGHMEM))
@@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void)
 void __init free_area_init(unsigned long *max_zone_pfn)
 {
 	unsigned long start_pfn, end_pfn;
-	int i, nid, zone;
+	int i, j, nid, zone;
 	bool descending;
 
 	/* Record where the zone boundaries are */
@@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	start_pfn = PHYS_PFN(memblock_start_of_DRAM());
 	descending = arch_has_descending_max_zone_pfns();
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i <= LAST_PHYS_ZONE; i++) {
 		if (descending)
-			zone = MAX_NR_ZONES - i - 1;
+			zone = LAST_PHYS_ZONE - i;
 		else
 			zone = i;
 
-		if (zone == ZONE_MOVABLE)
-			continue;
-
 		end_pfn = max(max_zone_pfn[zone], start_pfn);
 		arch_zone_lowest_possible_pfn[zone] = start_pfn;
 		arch_zone_highest_possible_pfn[zone] = end_pfn;
@@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 		start_pfn = end_pfn;
 	}
 
-	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
-	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
-	find_zone_movable_pfns_for_nodes();
+	/* Find the PFNs that virtual zones begin at in each node */
+	find_virt_zones();
 
 	/* Print out the zone ranges */
 	pr_info("Zone ranges:\n");
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		if (i == ZONE_MOVABLE)
-			continue;
+	for (i = 0; i <= LAST_PHYS_ZONE; i++) {
 		pr_info("  %-8s ", zone_names[i]);
 		if (arch_zone_lowest_possible_pfn[i] ==
 				arch_zone_highest_possible_pfn[i])
@@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 					<< PAGE_SHIFT) - 1);
 	}
 
-	/* Print out the PFNs ZONE_MOVABLE begins at in each node */
-	pr_info("Movable zone start for each node\n");
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (zone_movable_pfn[i])
-			pr_info("  Node %d: %#018Lx\n", i,
-			       (u64)zone_movable_pfn[i] << PAGE_SHIFT);
+	/* Print out the PFNs virtual zones begin at in each node */
+	for (; i <= LAST_VIRT_ZONE; i++) {
+		pr_info("%s zone start for each node\n", zone_names[i]);
+		for (j = 0; j < MAX_NUMNODES; j++) {
+			if (pfn_of(i, j))
+				pr_info("  Node %d: %#018Lx\n",
+					j, (u64)pfn_of(i, j) << PAGE_SHIFT);
+		}
 	}
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..6a4da8f8691c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = {
 	 "HighMem",
 #endif
 	 "Movable",
+	 "NoSplit",
+	 "NoMerge",
 #ifdef CONFIG_ZONE_DEVICE
 	 "Device",
 #endif
@@ -290,9 +292,9 @@ int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
 
-/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
-int movable_zone;
-EXPORT_SYMBOL(movable_zone);
+/* virt_zone is the "real" zone pages in virtual zones are taken from */
+int virt_zone;
+EXPORT_SYMBOL(virt_zone);
 
 #if MAX_NUMNODES > 1
 unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
@@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 	unsigned long higher_page_pfn;
 	struct page *higher_page;
 
-	if (order >= MAX_PAGE_ORDER - 1)
-		return false;
-
 	higher_page_pfn = buddy_pfn & pfn;
 	higher_page = page + (higher_page_pfn - pfn);
 
@@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 			NULL) != NULL;
 }
 
+static int zone_max_order(struct zone *zone)
+{
+	return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	int max_order = zone_max_order(zone);
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
-	while (order < MAX_PAGE_ORDER) {
+	while (order < max_order) {
 		if (compaction_capture(capc, page, order, migratetype)) {
 			__mod_zone_freepage_state(zone, -(1 << order),
 								migratetype);
@@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page,
 		to_tail = true;
 	else if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
+	else if (order + 1 >= max_order)
+		to_tail = false;
 	else
 		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
@@ -866,6 +873,8 @@ int split_free_page(struct page *free_page,
 	int mt;
 	int ret = 0;
 
+	VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page);
+
 	if (split_pfn_offset == 0)
 		return ret;
 
@@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	struct free_area *area;
 	struct page *page;
 
+	VM_WARN_ON_ONCE(!zone_is_suitable(zone, order));
+
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
@@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	long min = mark;
 	int o;
 
+	if (!zone_is_suitable(z, order))
+		return false;
+
 	/* free_pages may go negative - that's OK */
 	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
@@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 {
 	long free_pages;
 
+	if (!zone_is_suitable(z, order))
+		return false;
+
 	free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	/*
@@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
+		if (!zone_is_suitable(zone, order))
+			continue;
+
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!__cpuset_zone_allowed(zone, gfp_mask))
@@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void)
 	struct zone *zone;
 	unsigned long flags;
 
-	/* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */
+	/* Calculate total number of pages below ZONE_HIGHMEM */
 	for_each_zone(zone) {
-		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
+		if (zone_idx(zone) <= ZONE_NORMAL)
 			lowmem_pages += zone_managed_pages(zone);
 	}
 
@@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone_managed_pages(zone);
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
+		if (zone_idx(zone) > ZONE_NORMAL) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
-			 * need highmem and movable zones pages, so cap pages_min
-			 * to a small  value here.
+			 * need pages from zones above ZONE_NORMAL, so cap
+			 * pages_min to a small value here.
 			 *
 			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas control async page reclaim, and so should
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index cd0ea3668253..8a6473543427 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 		 * pages then it should be reasonably safe to assume the rest
 		 * is movable.
 		 */
-		if (zone_idx(zone) == ZONE_MOVABLE)
+		if (zid_is_virt(zone_idx(zone)))
 			continue;
 
 		/*
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..ad0db0373b05 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 	entry.val = 0;
 
 	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() &&
+		    folio_test_pmd_mappable(folio))
 			get_swap_pages(1, &entry, folio_nr_pages(folio));
 		goto out;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f9c854ce6cc..ae061ec4866a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					goto keep_locked;
 				if (folio_maybe_dma_pinned(folio))
 					goto keep_locked;
-				if (folio_test_large(folio)) {
-					/* cannot split folio, skip it */
-					if (!can_split_folio(folio, NULL))
-						goto activate_locked;
-					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
-					 */
-					if (!folio_entire_mapcount(folio) &&
-					    split_folio_to_list(folio,
-								folio_list))
-						goto activate_locked;
-				}
+				/*
+				 * Split folios that are not fully map right
+				 * away. Chances are some of the tail pages can
+				 * be freed without IO.
+				 */
+				if (folio_test_large(folio) &&
+				    atomic_read(&folio->_nr_pages_mapped) < nr_pages)
+					split_folio_to_list(folio, folio_list);
 				if (!add_to_swap(folio)) {
 					if (!folio_test_large(folio))
 						goto activate_locked_split;
@@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	orig_mask = sc->gfp_mask;
 	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
-		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
+		sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order);
 	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.gfp_mask = current_gfp_context(gfp_mask),
-		.reclaim_idx = gfp_zone(gfp_mask),
+		.reclaim_idx = gfp_order_zone(gfp_mask, order),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 	if (!cpuset_zone_allowed(zone, gfp_flags))
 		return;
 
+	curr_idx = gfp_order_zone(gfp_flags, order);
+	if (highest_zoneidx > curr_idx)
+		highest_zoneidx = curr_idx;
+
 	pgdat = zone->zone_pgdat;
 	curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx);
 
@@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
-		.reclaim_idx = gfp_zone(gfp_mask),
+		.reclaim_idx = gfp_order_zone(gfp_mask, order),
 	};
 	unsigned long pflags;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..adbd032e6a0f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 
 #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
 					TEXT_FOR_HIGHMEM(xx) xx "_movable", \
+					xx "_nosplit", xx "_nomerge", \
 					TEXT_FOR_DEVICE(xx)
 
 const char * const vmstat_text[] = {
@@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu"
-		   "\n        cma      %lu",
+		   "\n        cma      %lu"
+		   "\n        order    %u",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   zone->watermark_boost,
 		   min_wmark_pages(zone),
@@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone_managed_pages(zone),
-		   zone_cma_pages(zone));
+		   zone_cma_pages(zone),
+		   zone->order);
 
 	seq_printf(m,
 		   "\n        protection: (%ld",
-- 
2.44.0.rc1.240.g4c46232300-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Chapter Two] THP shattering: the reverse of collapsing
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
@ 2024-02-29 18:34 ` Yu Zhao
  2024-02-29 21:55   ` Zi Yan
  2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Yu Zhao @ 2024-02-29 18:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Jonathan Corbet, Yu Zhao

In contrast to split, shatter migrates occupied pages in a partially
mapped THP to a bunch of base folios. IOW, unlike split done in place,
shatter is the exact opposite of collapse.

The advantage of shattering is that it keeps the original THP intact.
The cost of copying during the migration is not a side effect, but
rather by design, since splitting is considered a discouraged
behavior. In retail terms, the return of a purchase is charged with a
restocking fee and the original goods can be resold.

THPs from ZONE_NOMERGE can only be shattered, since they cannot be
split or merged. THPs from ZONE_NOSPLIT can be shattered or split (the
latter requires [1]), if they are above the minimum order.

[1] https://lore.kernel.org/20240226205534.1603748-1-zi.yan@sent.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/memcontrol.h    |   5 +
 include/linux/mm_inline.h     |  24 +++
 include/linux/mm_types.h      |   8 +-
 include/linux/vm_event_item.h |   3 +
 mm/huge_memory.c              | 303 ++++++++++++++++++++++++++++------
 mm/internal.h                 |  38 +++++
 mm/madvise.c                  |  11 +-
 mm/memcontrol.c               |  47 ++++++
 mm/memory-failure.c           |   2 +-
 mm/migrate.c                  |  44 +++--
 mm/page_alloc.c               |   4 +
 mm/rmap.c                     |   4 +
 mm/shmem.c                    |   4 +-
 mm/truncate.c                 |   6 +-
 mm/userfaultfd.c              |   2 +-
 mm/vmscan.c                   |   9 +
 mm/vmstat.c                   |   3 +
 17 files changed, 443 insertions(+), 74 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ff87f8e001..435cf114c6e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1163,6 +1163,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 }
 
 void split_page_memcg(struct page *head, unsigned int nr);
+void folio_copy_memcg(struct folio *folio);
 
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
@@ -1624,6 +1625,10 @@ static inline void split_page_memcg(struct page *head, unsigned int nr)
 {
 }
 
+static inline void folio_copy_memcg(struct folio *folio)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f4fe593c1400..aa96d6ed0223 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -218,6 +218,25 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
 	VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 }
 
+static inline bool lru_gen_add_dst(struct lruvec *lruvec, struct folio *dst)
+{
+	int gen = folio_lru_gen(dst);
+	int type = folio_is_file_lru(dst);
+	int zone = folio_zonenum(dst);
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+	if (gen < 0)
+		return false;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+	VM_WARN_ON_ONCE_FOLIO(folio_lruvec(dst) != lruvec, dst);
+
+	list_add_tail(&dst->lru, &lrugen->folios[gen][type][zone]);
+	lru_gen_update_size(lruvec, dst, -1, gen);
+
+	return true;
+}
+
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	unsigned long seq;
@@ -303,6 +322,11 @@ static inline bool lru_gen_in_fault(void)
 	return false;
 }
 
+static inline bool lru_gen_add_dst(struct lruvec *lruvec, struct folio *dst)
+{
+	return false;
+}
+
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	return false;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..f483b273e80e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -323,14 +323,19 @@ struct folio {
 		struct {
 			unsigned long _flags_1;
 			unsigned long _head_1;
-			unsigned long _folio_avail;
 	/* public: */
 			atomic_t _entire_mapcount;
 			atomic_t _nr_pages_mapped;
 			atomic_t _pincount;
 #ifdef CONFIG_64BIT
+			unsigned int __padding;
 			unsigned int _folio_nr_pages;
 #endif
+			union {
+				unsigned long _private_1;
+				unsigned long *_dst_ul;
+				struct page **_dst_pp;
+			};
 	/* private: the union with struct page is transitional */
 		};
 		struct page __page_1;
@@ -382,6 +387,7 @@ FOLIO_MATCH(_last_cpupid, _last_cpupid);
 			offsetof(struct page, pg) + sizeof(struct page))
 FOLIO_MATCH(flags, _flags_1);
 FOLIO_MATCH(compound_head, _head_1);
+FOLIO_MATCH(private, _private_1);
 #undef FOLIO_MATCH
 #define FOLIO_MATCH(pg, fl)						\
 	static_assert(offsetof(struct folio, fl) ==			\
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9a54d15d5ec3..027851c795bc 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SHATTER_PAGE,
+		THP_SHATTER_PAGE_FAILED,
+		THP_SHATTER_PAGE_DISCARDED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b57faa0a1e83..62d2254bc51c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2586,6 +2586,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_swp_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_swp_mkuffd_wp(entry);
+			if (vma->vm_flags & VM_LOCKED)
+				set_src_usage(page + i, SRC_PAGE_MLOCKED);
+			else
+				set_src_usage(page + i, SRC_PAGE_MAPPED);
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			if (write)
@@ -2732,6 +2736,178 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	}
 }
 
+static int prep_to_unmap(struct folio *src)
+{
+	int nr_pages = folio_nr_pages(src);
+
+	if (folio_can_split(src))
+		return 0;
+
+	WARN_ON_ONCE(src->_dst_pp);
+
+	src->_dst_pp = kcalloc(nr_pages, sizeof(struct page *), GFP_ATOMIC);
+
+	return src->_dst_pp ? 0 : -ENOMEM;
+}
+
+static bool try_to_discard(struct folio *src, int i)
+{
+	int usage;
+	void *addr;
+	struct page *page = folio_page(src, i);
+
+	if (!folio_test_anon(src))
+		return false;
+
+	if (folio_test_swapcache(src))
+		return false;
+
+	usage = src_page_usage(page);
+	if (usage & SRC_PAGE_MLOCKED)
+		return false;
+
+	if (!(usage & SRC_PAGE_MAPPED))
+		return true;
+
+	addr = kmap_local_page(page);
+	if (!memchr_inv(addr, 0, PAGE_SIZE))
+		set_src_usage(page, SRC_PAGE_CLEAN);
+	kunmap_local(addr);
+
+	return can_discard_src(page);
+}
+
+static int prep_dst_pages(struct folio *src)
+{
+	int i;
+	int nr_pages = folio_nr_pages(src);
+
+	if (folio_can_split(src))
+		return 0;
+
+	if (WARN_ON_ONCE(!src->_dst_pp))
+		return -ENOMEM;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *dst = NULL;
+
+		if (try_to_discard(src, i)) {
+			count_vm_event(THP_SHATTER_PAGE_DISCARDED);
+			continue;
+		}
+
+		do {
+			int nid = folio_nid(src);
+			gfp_t gfp = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
+				    GFP_NOWAIT | __GFP_THISNODE;
+
+			if (dst)
+				__free_page(dst);
+
+			dst = alloc_pages_node(nid, gfp, 0);
+			if (!dst)
+				return -ENOMEM;
+		} while (!page_ref_freeze(dst, 1));
+
+		copy_highpage(dst, folio_page(src, i));
+		src->_dst_ul[i] |= (unsigned long)dst;
+
+		cond_resched();
+	}
+
+	return 0;
+}
+
+static void free_dst_pages(struct folio *src)
+{
+	int i;
+	int nr_pages = folio_nr_pages(src);
+
+	if (folio_can_split(src))
+		return;
+
+	if (!src->_dst_pp)
+		return;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *dst = folio_dst_page(src, i);
+
+		if (!dst)
+			continue;
+
+		page_ref_unfreeze(dst, 1);
+		__free_page(dst);
+	}
+
+	kfree(src->_dst_pp);
+	src->_dst_pp = NULL;
+}
+
+static void reset_src_folio(struct folio *src)
+{
+	if (folio_can_split(src))
+		return;
+
+	if (WARN_ON_ONCE(!src->_dst_pp))
+		return;
+
+	if (!folio_mapping_flags(src))
+		src->mapping = NULL;
+
+	if (folio_test_anon(src) && folio_test_swapcache(src)) {
+		folio_clear_swapcache(src);
+		src->swap.val = 0;
+	}
+
+	kfree(src->_dst_pp);
+	src->_dst_pp = NULL;
+}
+
+static void copy_page_owner(struct folio *src)
+{
+	int i;
+	int nr_pages = folio_nr_pages(src);
+
+	if (folio_can_split(src))
+		return;
+
+	if (WARN_ON_ONCE(!src->_dst_pp))
+		return;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *dst = folio_dst_page(src, i);
+
+		if (dst)
+			folio_copy_owner(src, page_folio(dst));
+	}
+}
+
+static bool lru_add_dst(struct lruvec *lruvec, struct folio *src, struct folio *dst)
+{
+	if (folio_can_split(src))
+		return false;
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_lru(src), src);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(dst), dst);
+	VM_WARN_ON_ONCE_FOLIO(folio_lruvec(dst) != folio_lruvec(src), dst);
+
+	if (!lru_gen_add_dst(lruvec, dst)) {
+		enum lru_list lru = folio_lru_list(dst);
+		int zone = folio_zonenum(dst);
+		int delta = folio_nr_pages(dst);
+
+		if (folio_test_unevictable(dst))
+			dst->mlock_count = 0;
+		else
+			list_add_tail(&dst->lru, &src->lru);
+		update_lru_size(lruvec, lru, zone, delta);
+	}
+
+	folio_set_lru(dst);
+
+	return true;
+}
+
 static void lru_add_page_tail(struct page *head, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
@@ -2745,7 +2921,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 		VM_WARN_ON(PageLRU(head));
 		get_page(tail);
 		list_add_tail(&tail->lru, list);
-	} else {
+	} else if (!lru_add_dst(lruvec, page_folio(head), page_folio(tail))) {
 		/* head is still on lru (and we have it frozen) */
 		VM_WARN_ON(!PageLRU(head));
 		if (PageUnevictable(tail))
@@ -2760,7 +2936,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
 	struct page *head = &folio->page;
-	struct page *page_tail = head + tail;
+	struct page *page_tail = folio_dst_page(folio, tail);
 	/*
 	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
 	 * Don't pass it around before clear_compound_head().
@@ -2801,8 +2977,8 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first and second tail page is replaced by other uses */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
+	VM_BUG_ON_PAGE(folio_can_split(folio) && tail > 2 &&
+		       page_tail->mapping != TAIL_MAPPING, page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
 
@@ -2857,9 +3033,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	unsigned int nr = thp_nr_pages(head);
 	int i, nr_dropped = 0;
+	bool can_split = folio_can_split(folio);
 
 	/* complete memcg works before add pages to LRU */
-	split_page_memcg(head, nr);
+	if (can_split)
+		split_page_memcg(head, nr);
+	else
+		folio_copy_memcg(folio);
 
 	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
 		offset = swp_offset(folio->swap);
@@ -2872,46 +3052,53 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageHasHWPoisoned(head);
 
-	for (i = nr - 1; i >= 1; i--) {
+	for (i = nr - 1; i >= can_split; i--) {
+		struct page *dst = folio_dst_page(folio, i);
+
+		if (!dst)
+			continue;
+
 		__split_huge_page_tail(folio, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
-		if (head[i].index >= end) {
-			struct folio *tail = page_folio(head + i);
+		if (dst->index >= end) {
+			struct folio *tail = page_folio(dst);
 
-			if (shmem_mapping(head->mapping))
+			if (shmem_mapping(tail->mapping))
 				nr_dropped++;
 			else if (folio_test_clear_dirty(tail))
 				folio_account_cleaned(tail,
-					inode_to_wb(folio->mapping->host));
+					inode_to_wb(tail->mapping->host));
 			__filemap_remove_folio(tail, NULL);
 			folio_put(tail);
-		} else if (!PageAnon(page)) {
-			__xa_store(&head->mapping->i_pages, head[i].index,
-					head + i, 0);
+		} else if (!PageAnon(dst)) {
+			__xa_store(&dst->mapping->i_pages, dst->index, dst, 0);
 		} else if (swap_cache) {
-			__xa_store(&swap_cache->i_pages, offset + i,
-					head + i, 0);
+			__xa_store(&swap_cache->i_pages, offset + i, dst, 0);
 		}
 	}
 
-	ClearPageCompound(head);
+	if (can_split)
+		ClearPageCompound(head);
 	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
-	split_page_owner(head, nr);
+	if (can_split)
+		split_page_owner(head, nr);
+	else
+		copy_page_owner(folio);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
 		/* Additional pin to swap cache */
 		if (PageSwapCache(head)) {
-			page_ref_add(head, 2);
+			page_ref_add(head, 2 - !can_split);
 			xa_unlock(&swap_cache->i_pages);
 		} else {
 			page_ref_inc(head);
 		}
 	} else {
 		/* Additional pin to page cache */
-		page_ref_add(head, 2);
+		page_ref_add(head, 2 - !can_split);
 		xa_unlock(&head->mapping->i_pages);
 	}
 	local_irq_enable();
@@ -2924,8 +3111,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		split_swap_cluster(folio->swap);
 
 	for (i = 0; i < nr; i++) {
-		struct page *subpage = head + i;
-		if (subpage == page)
+		struct page *subpage = folio_dst_page(folio, i);
+
+		if (!subpage || subpage == page)
 			continue;
 		unlock_page(subpage);
 
@@ -2945,9 +3133,6 @@ static bool can_split_folio(struct folio *folio, int *pextra_pins)
 {
 	int extra_pins;
 
-	if (!folio_can_split(folio))
-		return false;
-
 	/* Additional pins from page cache */
 	if (folio_test_anon(folio))
 		extra_pins = folio_test_swapcache(folio) ?
@@ -3067,8 +3252,21 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 	}
 
+	ret = prep_to_unmap(folio);
+	if (ret)
+		goto out_unlock;
+
 	unmap_folio(folio);
 
+	if (!folio_ref_freeze(folio, 1 + extra_pins)) {
+		ret = -EAGAIN;
+		goto fail;
+	}
+
+	ret = prep_dst_pages(folio);
+	if (ret)
+		goto fail;
+
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
 	if (mapping) {
@@ -3078,44 +3276,41 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 */
 		xas_lock(&xas);
 		xas_reset(&xas);
-		if (xas_load(&xas) != folio)
+		if (xas_load(&xas) != folio) {
+			xas_unlock(&xas);
+			local_irq_enable();
+			ret = -EAGAIN;
 			goto fail;
+		}
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		if (!list_empty(&folio->_deferred_list)) {
-			ds_queue->split_queue_len--;
-			list_del(&folio->_deferred_list);
-		}
-		spin_unlock(&ds_queue->split_queue_lock);
-		if (mapping) {
-			int nr = folio_nr_pages(folio);
+	if (!list_empty(&folio->_deferred_list)) {
+		ds_queue->split_queue_len--;
+		list_del_init(&folio->_deferred_list);
+	}
+	spin_unlock(&ds_queue->split_queue_lock);
+	if (mapping) {
+		int nr = folio_nr_pages(folio);
 
-			xas_split(&xas, folio, folio_order(folio));
-			if (folio_test_pmd_mappable(folio)) {
-				if (folio_test_swapbacked(folio)) {
-					__lruvec_stat_mod_folio(folio,
-							NR_SHMEM_THPS, -nr);
-				} else {
-					__lruvec_stat_mod_folio(folio,
-							NR_FILE_THPS, -nr);
-					filemap_nr_thps_dec(mapping);
-				}
+		xas_split(&xas, folio, folio_order(folio));
+		if (folio_test_pmd_mappable(folio)) {
+			if (folio_test_swapbacked(folio)) {
+				__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
+			} else {
+				__lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
+				filemap_nr_thps_dec(mapping);
 			}
 		}
+	}
 
-		__split_huge_page(page, list, end);
-		ret = 0;
-	} else {
-		spin_unlock(&ds_queue->split_queue_lock);
+	__split_huge_page(page, list, end);
+	reset_src_folio(folio);
 fail:
-		if (mapping)
-			xas_unlock(&xas);
-		local_irq_enable();
+	if (ret) {
+		free_dst_pages(folio);
 		remap_page(folio, folio_nr_pages(folio));
-		ret = -EAGAIN;
 	}
 
 out_unlock:
@@ -3127,6 +3322,12 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		i_mmap_unlock_read(mapping);
 out:
 	xas_destroy(&xas);
+
+	if (!folio_can_split(folio)) {
+		count_vm_event(!ret ? THP_SHATTER_PAGE : THP_SHATTER_PAGE_FAILED);
+		return ret ? : 1;
+	}
+
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..ac1d27468899 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1266,4 +1266,42 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
 }
 #endif /* CONFIG_SHRINKER_DEBUG */
 
+#define SRC_PAGE_MAPPED		BIT(0)
+#define SRC_PAGE_MLOCKED	BIT(1)
+#define SRC_PAGE_CLEAN		BIT(2)
+#define SRC_PAGE_USAGE_MASK	(BIT(3) - 1)
+
+static inline unsigned long src_page_usage(struct page *page)
+{
+	struct folio *src = page_folio(page);
+	int i = folio_page_idx(src, page);
+
+	if (folio_can_split(src) || !src->_dst_ul)
+		return 0;
+
+	return src->_dst_ul[i] & SRC_PAGE_USAGE_MASK;
+}
+
+static inline bool can_discard_src(struct page *page)
+{
+	return src_page_usage(page) & SRC_PAGE_CLEAN;
+}
+
+static inline void set_src_usage(struct page *page, unsigned long usage)
+{
+	struct folio *src = page_folio(page);
+	int i = folio_page_idx(src, page);
+
+	if (!folio_can_split(src) && src->_dst_ul)
+		src->_dst_ul[i] |= usage;
+}
+
+static inline struct page *folio_dst_page(struct folio *src, int i)
+{
+	if (folio_can_split(src) || !src->_dst_ul)
+		return folio_page(src, i);
+
+	return (void *)(src->_dst_ul[i] & ~SRC_PAGE_USAGE_MASK);
+}
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index cfa5e7288261..0f82e132cd52 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -381,7 +381,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (!err)
+			if (err >= 0)
 				goto regular_folio;
 			return 0;
 		}
@@ -466,8 +466,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (err)
+			if (err < 0)
 				break;
+			if (err)
+				goto restart;
 			start_pte = pte =
 				pte_offset_map_lock(mm, pmd, addr, &ptl);
 			if (!start_pte)
@@ -635,6 +637,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
+restart:
 	start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!start_pte)
 		return 0;
@@ -688,8 +691,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (err)
+			if (err < 0)
 				break;
+			if (err)
+				goto restart;
 			start_pte = pte =
 				pte_offset_map_lock(mm, pmd, addr, &ptl);
 			if (!start_pte)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61932c9215e7..71b4d1e610db 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3624,6 +3624,53 @@ void split_page_memcg(struct page *head, unsigned int nr)
 		css_get_many(&memcg->css, nr - 1);
 }
 
+void folio_copy_memcg(struct folio *src)
+{
+	int i;
+	unsigned long flags;
+	int delta = 0;
+	int nr_pages = folio_nr_pages(src);
+	struct mem_cgroup *memcg = folio_memcg(src);
+
+	if (folio_can_split(src))
+		return;
+
+	if (WARN_ON_ONCE(!src->_dst_pp))
+		return;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	if (WARN_ON_ONCE(!memcg))
+		return;
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_large(src), src);
+	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(src), src);
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *dst = folio_dst_page(src, i);
+
+		if (!dst)
+			continue;
+
+		commit_charge(page_folio(dst), memcg);
+		delta++;
+	}
+
+	if (!mem_cgroup_is_root(memcg)) {
+		page_counter_charge(&memcg->memory, delta);
+		if (do_memsw_account())
+			page_counter_charge(&memcg->memsw, delta);
+	}
+
+	css_get_many(&memcg->css, delta);
+
+	local_irq_save(flags);
+	mem_cgroup_charge_statistics(memcg, delta);
+	memcg_check_events(memcg, folio_nid(src));
+	local_irq_restore(flags);
+}
+
 #ifdef CONFIG_SWAP
 /**
  * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9349948f1abf..b9d2f821ba63 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2289,7 +2289,7 @@ int memory_failure(unsigned long pfn, int flags)
 		 * page is a valid handlable page.
 		 */
 		SetPageHasHWPoisoned(hpage);
-		if (try_to_split_thp_page(p) < 0) {
+		if (try_to_split_thp_page(p)) {
 			res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
 			goto unlock_mutex;
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index f615c0c22046..610be0029efd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -180,36 +180,52 @@ void putback_movable_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+static bool remove_migration_pte(struct folio *dst,
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct folio *src = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, src, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
 		pte_t old_pte;
 		pte_t pte;
 		swp_entry_t entry;
-		struct page *new;
+		struct page *page;
+		struct folio *folio;
 		unsigned long idx = 0;
 
 		/* pgoff is invalid for ksm pages, but they are never large */
-		if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+		if (folio_test_large(dst) && !folio_test_hugetlb(dst))
 			idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff;
-		new = folio_page(folio, idx);
+		page = folio_page(dst, idx);
+
+		if (src == dst) {
+			if (can_discard_src(page)) {
+				VM_WARN_ON_ONCE_FOLIO(!folio_test_anon(src), src);
+
+				pte_clear_not_present_full(pvmw.vma->vm_mm, pvmw.address,
+							   pvmw.pte, false);
+				dec_mm_counter(pvmw.vma->vm_mm, MM_ANONPAGES);
+				continue;
+			}
+			page = folio_dst_page(src, idx);
+		}
+
+		folio = page_folio(page);
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
-			remove_migration_pmd(&pvmw, new);
+			remove_migration_pmd(&pvmw, page);
 			continue;
 		}
 #endif
 
 		folio_get(folio);
-		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
+		pte = mk_pte(page, READ_ONCE(vma->vm_page_prot));
 		old_pte = ptep_get(pvmw.pte);
 		if (pte_swp_soft_dirty(old_pte))
 			pte = pte_mksoft_dirty(pte);
@@ -227,13 +243,13 @@ static bool remove_migration_pte(struct folio *folio,
 		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
 			rmap_flags |= RMAP_EXCLUSIVE;
 
-		if (unlikely(is_device_private_page(new))) {
+		if (unlikely(is_device_private_page(page))) {
 			if (pte_write(pte))
 				entry = make_writable_device_private_entry(
-							page_to_pfn(new));
+							page_to_pfn(page));
 			else
 				entry = make_readable_device_private_entry(
-							page_to_pfn(new));
+							page_to_pfn(page));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(old_pte))
 				pte = pte_swp_mksoft_dirty(pte);
@@ -259,17 +275,17 @@ static bool remove_migration_pte(struct folio *folio,
 #endif
 		{
 			if (folio_test_anon(folio))
-				folio_add_anon_rmap_pte(folio, new, vma,
+				folio_add_anon_rmap_pte(folio, page, vma,
 							pvmw.address, rmap_flags);
 			else
-				folio_add_file_rmap_pte(folio, new, vma);
+				folio_add_file_rmap_pte(folio, page, vma);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 
 		trace_remove_migration_pte(pvmw.address, pte_val(pte),
-					   compound_order(new));
+					   compound_order(page));
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6a4da8f8691c..dd843fb04f78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1012,6 +1012,10 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 			bad_page(page, "nonzero pincount");
 			goto out;
 		}
+		if (unlikely(folio->_private_1)) {
+			bad_page(page, "nonzero _private_1");
+			goto out;
+		}
 		break;
 	case 2:
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index f5d43edad529..0ddb28c52961 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2260,6 +2260,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 						hsz);
 			else
 				set_pte_at(mm, address, pvmw.pte, swp_pte);
+			if (vma->vm_flags & VM_LOCKED)
+				set_src_usage(subpage, SRC_PAGE_MLOCKED);
+			else
+				set_src_usage(subpage, SRC_PAGE_MAPPED);
 			trace_set_migration_pte(address, pte_val(swp_pte),
 						compound_order(&folio->page));
 			/*
diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..8fa8056d3724 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -696,7 +696,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		folio_put(folio);
 
 		/* If split failed move the inode on the list back to shrinklist */
-		if (ret)
+		if (ret < 0)
 			goto move_back;
 
 		split++;
@@ -1450,7 +1450,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (folio_test_large(folio)) {
 		/* Ensure the subpages are still dirty */
 		folio_test_set_dirty(folio);
-		if (split_huge_page(page) < 0)
+		if (split_huge_page(page))
 			goto redirty;
 		folio = page_folio(page);
 		folio_clear_dirty(folio);
diff --git a/mm/truncate.c b/mm/truncate.c
index 725b150e47ac..df0680cfe6a2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -209,6 +209,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
  */
 bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 {
+	int err;
 	loff_t pos = folio_pos(folio);
 	unsigned int offset, length;
 
@@ -239,8 +240,11 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		folio_invalidate(folio, offset, length);
 	if (!folio_test_large(folio))
 		return true;
-	if (split_folio(folio) == 0)
+	err = split_folio(folio);
+	if (!err)
 		return true;
+	if (err > 0)
+		return false;
 	if (folio_test_dirty(folio))
 		return false;
 	truncate_inode_folio(folio->mapping, folio);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7cf7d4384259..cf490b101cac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1094,7 +1094,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			pte_unmap(&orig_dst_pte);
 			src_pte = dst_pte = NULL;
 			err = split_folio(src_folio);
-			if (err)
+			if (err < 0)
 				goto out;
 			/* have to reacquire the folio after it got split */
 			folio_unlock(src_folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae061ec4866a..d6c31421a3b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1223,6 +1223,15 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 				goto keep_locked;
 		}
 
+		if (folio_ref_count(folio) == 1) {
+			folio_unlock(folio);
+			if (folio_put_testzero(folio))
+				goto free_it;
+
+			nr_reclaimed += nr_pages;
+			continue;
+		}
+
 		/*
 		 * If the folio was split above, the tail pages will make
 		 * their own pass through this function and be accounted
diff --git a/mm/vmstat.c b/mm/vmstat.c
index adbd032e6a0f..ff2114452334 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1364,6 +1364,9 @@ const char * const vmstat_text[] = {
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
 	"thp_split_pmd",
+	"thp_shatter_page",
+	"thp_shatter_page_failed",
+	"thp_shatter_page_discarded",
 	"thp_scan_exceed_none_pte",
 	"thp_scan_exceed_swap_pte",
 	"thp_scan_exceed_share_pte",
-- 
2.44.0.rc1.240.g4c46232300-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Chapter Three] THP HVO: bring the hugeTLB feature to THP
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
  2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
@ 2024-02-29 18:34 ` Yu Zhao
  2024-02-29 22:54   ` Yang Shi
  2024-02-29 18:34 ` [Epilogue] Profile-Guided Heap Optimization and THP fungibility Yu Zhao
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Yu Zhao @ 2024-02-29 18:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Jonathan Corbet, Yu Zhao

HVO can be one of the perks for heavy THP users like it is for hugeTLB
users. For example, if such a user uses 60% of physical memory for 2MB
THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8
~= 50%).

ZONE_NOMERGE considerably simplifies the implementation of HVO for
THPs, since THPs from it cannot be split or merged and thus do not
require any correctness-related operations on tail pages beyond the
second one.

If a THP is mapped by PTEs, two optimization-related operations on its
tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to
track a group of pages, e.g., eight pages per group for 2MB THPs. The
estimation, as the copying cost incurred during shattering, is also by
design, since mapping by PTEs is another discouraged behavior.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mm.h     | 140 ++++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h |   1 +
 include/linux/rmap.h   |   4 ++
 init/main.c            |   1 +
 mm/gup.c               |   3 +-
 mm/huge_memory.c       |   2 +
 mm/hugetlb_vmemmap.c   |   2 +-
 mm/internal.h          |   9 ---
 mm/memory.c            |  11 +--
 mm/page_alloc.c        | 151 ++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c              |  17 ++++-
 mm/vmstat.c            |   2 +
 12 files changed, 323 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..d7014fc35cca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,138 @@ static inline void page_mapcount_reset(struct page *page)
 	atomic_set(&(page)->_mapcount, -1);
 }
 
+#define HVO_MOD (PAGE_SIZE / sizeof(struct page))
+
+static inline int hvo_order_size(int order)
+{
+	if (PAGE_SIZE % sizeof(struct page) || !is_power_of_2(HVO_MOD))
+		return 0;
+
+	return (1 << order) * sizeof(struct page);
+}
+
+static inline bool page_hvo_suitable(struct page *head, int order)
+{
+	VM_WARN_ON_ONCE_PAGE(!test_bit(PG_head, &head->flags), head);
+
+	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
+		return false;
+
+	return page_zonenum(head) == ZONE_NOMERGE &&
+	       IS_ALIGNED((unsigned long)head, PAGE_SIZE) &&
+	       hvo_order_size(order) > PAGE_SIZE;
+}
+
+static inline bool folio_hvo_suitable(struct folio *folio)
+{
+	return folio_test_large(folio) && page_hvo_suitable(&folio->page, folio_order(folio));
+}
+
+static inline bool page_is_hvo(struct page *head, int order)
+{
+	return page_hvo_suitable(head, order) && test_bit(PG_head, &head[HVO_MOD].flags);
+}
+
+static inline bool folio_is_hvo(struct folio *folio)
+{
+	return folio_test_large(folio) && page_is_hvo(&folio->page, folio_order(folio));
+}
+
+/*
+ * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
+ * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
+ * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
+ * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
+ */
+#define ENTIRELY_MAPPED		0x800000
+#define FOLIO_PAGES_MAPPED	(ENTIRELY_MAPPED - 1)
+
+static inline int hvo_range_mapcount(struct folio *folio, struct page *page, int nr_pages, int *ret)
+{
+	int i, next, end;
+	int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
+
+	if (!folio_is_hvo(folio))
+		return false;
+
+	*ret = folio_entire_mapcount(folio);
+
+	for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
+		next = min(end, round_down(i + stride, stride));
+
+		page = folio_page(folio, i / stride);
+		*ret += atomic_read(&page->_mapcount) + 1;
+	}
+
+	return true;
+}
+
+static inline bool hvo_map_range(struct folio *folio, struct page *page, int nr_pages, int *ret)
+{
+	int i, next, end;
+	int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
+
+	if (!folio_is_hvo(folio))
+		return false;
+
+	*ret = 0;
+
+	for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
+		next = min(end, round_down(i + stride, stride));
+
+		page = folio_page(folio, i / stride);
+		if (atomic_add_return(next - i, &page->_mapcount) == next - i - 1)
+			*ret += stride;
+	}
+
+	if (atomic_add_return(*ret, &folio->_nr_pages_mapped) >= ENTIRELY_MAPPED)
+		*ret = 0;
+
+	return true;
+}
+
+static inline bool hvo_unmap_range(struct folio *folio, struct page *page, int nr_pages, int *ret)
+{
+	int i, next, end;
+	int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
+
+	if (!folio_is_hvo(folio))
+		return false;
+
+	*ret = 0;
+
+	for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
+		next = min(end, round_down(i + stride, stride));
+
+		page = folio_page(folio, i / stride);
+		if (atomic_sub_return(next - i, &page->_mapcount) == -1)
+			*ret += stride;
+	}
+
+	if (atomic_sub_return(*ret, &folio->_nr_pages_mapped) >= ENTIRELY_MAPPED)
+		*ret = 0;
+
+	return true;
+}
+
+static inline bool hvo_dup_range(struct folio *folio, struct page *page, int nr_pages)
+{
+	int i, next, end;
+	int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
+
+	if (!folio_is_hvo(folio))
+		return false;
+
+	for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
+		next = min(end, round_down(i + stride, stride));
+
+		page = folio_page(folio, i / stride);
+		atomic_add(next - i, &page->_mapcount);
+	}
+
+	return true;
+}
+
 /**
  * page_mapcount() - Number of times this precise page is mapped.
  * @page: The page.
@@ -1212,6 +1344,9 @@ static inline int page_mapcount(struct page *page)
 {
 	int mapcount = atomic_read(&page->_mapcount) + 1;
 
+	if (hvo_range_mapcount(page_folio(page), page, 1, &mapcount))
+		return mapcount;
+
 	if (unlikely(PageCompound(page)))
 		mapcount += folio_entire_mapcount(page_folio(page));
 
@@ -3094,6 +3229,11 @@ static inline void pagetable_pud_dtor(struct ptdesc *ptdesc)
 
 extern void __init pagecache_init(void);
 extern void free_initmem(void);
+extern void free_vmemmap(void);
+extern int vmemmap_remap_free(unsigned long start, unsigned long end,
+			      unsigned long reuse,
+			      struct list_head *vmemmap_pages,
+			      unsigned long flags);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 532218167bba..00e4bb6c8533 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -916,6 +916,7 @@ struct zone {
 #ifdef CONFIG_CMA
 	unsigned long		cma_pages;
 #endif
+	atomic_long_t		hvo_freed;
 
 	const char		*name;
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b7944a833668..d058c4cb3c96 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -322,6 +322,8 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
 
 	switch (level) {
 	case RMAP_LEVEL_PTE:
+		if (hvo_dup_range(folio, page, nr_pages))
+			break;
 		do {
 			atomic_inc(&page->_mapcount);
 		} while (page++, --nr_pages > 0);
@@ -401,6 +403,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 				if (PageAnonExclusive(page + i))
 					return -EBUSY;
 		}
+		if (hvo_dup_range(folio, page, nr_pages))
+			break;
 		do {
 			if (PageAnonExclusive(page))
 				ClearPageAnonExclusive(page);
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..74003495db32 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1448,6 +1448,7 @@ static int __ref kernel_init(void *unused)
 	kgdb_free_init_mem();
 	exit_boot_config();
 	free_initmem();
+	free_vmemmap();
 	mark_readonly();
 
 	/*
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..f3df0078505b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -57,7 +57,7 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 			continue;
 		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
-		else
+		else if (!folio_is_hvo(folio) || !folio_nr_pages_mapped(folio))
 			/* Either a PTE-mapped or a PMD-mapped THP. */
 			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) &&
 				       !PageAnonExclusive(page), page);
@@ -645,6 +645,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+		       !folio_is_hvo(page_folio(page)) &&
 		       !PageAnonExclusive(page), page);
 
 	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 62d2254bc51c..9e7e5d587a5c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,6 +2535,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 *
 		 * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
 		 */
+		if (folio_is_hvo(folio))
+			ClearPageAnonExclusive(page);
 		anon_exclusive = PageAnonExclusive(page);
 		if (freeze && anon_exclusive &&
 		    folio_try_share_anon_rmap_pmd(folio, page))
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index da177e49d956..9f43d900e83c 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -310,7 +310,7 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
  *
  * Return: %0 on success, negative error code otherwise.
  */
-static int vmemmap_remap_free(unsigned long start, unsigned long end,
+int vmemmap_remap_free(unsigned long start, unsigned long end,
 			      unsigned long reuse,
 			      struct list_head *vmemmap_pages,
 			      unsigned long flags)
diff --git a/mm/internal.h b/mm/internal.h
index ac1d27468899..871c6eeb78b8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -52,15 +52,6 @@ struct folio_batch;
 
 void page_writeback_init(void);
 
-/*
- * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
- * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
- * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
- * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
- */
-#define ENTIRELY_MAPPED		0x800000
-#define FOLIO_PAGES_MAPPED	(ENTIRELY_MAPPED - 1)
-
 /*
  * Flags passed to __show_mem() and show_free_areas() to suppress output in
  * various contexts.
diff --git a/mm/memory.c b/mm/memory.c
index 0bfc8b007c01..db389f1d776d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3047,8 +3047,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
 	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
 
 	if (folio) {
-		VM_BUG_ON(folio_test_anon(folio) &&
-			  !PageAnonExclusive(vmf->page));
+		VM_BUG_ON_PAGE(folio_test_anon(folio) && !folio_is_hvo(folio) &&
+			       !PageAnonExclusive(vmf->page), vmf->page);
 		/*
 		 * Clear the folio's cpupid information as the existing
 		 * information potentially belongs to a now completely
@@ -3502,7 +3502,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	 */
 	if (folio && folio_test_anon(folio) &&
 	    (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) {
-		if (!PageAnonExclusive(vmf->page))
+		if (!folio_is_hvo(folio) && !PageAnonExclusive(vmf->page))
 			SetPageAnonExclusive(vmf->page);
 		if (unlikely(unshare)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4100,8 +4100,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					rmap_flags);
 	}
 
-	VM_BUG_ON(!folio_test_anon(folio) ||
-			(pte_write(pte) && !PageAnonExclusive(page)));
+	VM_BUG_ON_PAGE(!folio_test_anon(folio) ||
+		       (pte_write(pte) && !folio_is_hvo(folio) && !PageAnonExclusive(page)),
+		       page);
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd843fb04f78..5f8c6583a191 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -53,6 +53,7 @@
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
+#include <linux/bootmem_info.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -585,6 +586,10 @@ void prep_compound_page(struct page *page, unsigned int order)
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
+
+	if (page_is_hvo(page, order))
+		nr_pages = HVO_MOD;
+
 	for (i = 1; i < nr_pages; i++)
 		prep_compound_tail(page, i);
 
@@ -1124,10 +1129,15 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	 */
 	if (unlikely(order)) {
 		int i;
+		int nr_pages = 1 << order;
 
-		if (compound)
+		if (compound) {
+			if (page_is_hvo(page, order))
+				nr_pages = HVO_MOD;
 			page[1].flags &= ~PAGE_FLAGS_SECOND;
-		for (i = 1; i < (1 << order); i++) {
+		}
+
+		for (i = 1; i < nr_pages; i++) {
 			if (compound)
 				bad += free_tail_page_prepare(page, page + i);
 			if (is_check_pages_enabled()) {
@@ -1547,6 +1557,141 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	page_table_check_alloc(page, order);
 }
 
+static void prep_hvo_page(struct page *head, int order)
+{
+	LIST_HEAD(list);
+	struct page *page, *next;
+	int freed = 0;
+	unsigned long start = (unsigned long)head;
+	unsigned long end = start + hvo_order_size(order);
+
+	if (page_zonenum(head) != ZONE_NOMERGE)
+		return;
+
+	if (WARN_ON_ONCE(order != page_zone(head)->order)) {
+		bad_page(head, "invalid page order");
+		return;
+	}
+
+	if (!page_hvo_suitable(head, order) || page_is_hvo(head, order))
+		return;
+
+	vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0);
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		if (PageReserved(page))
+			free_bootmem_page(page);
+		else
+			__free_page(page);
+		freed++;
+	}
+
+	atomic_long_add(freed, &page_zone(head)->hvo_freed);
+}
+
+static void prep_nomerge_zone(struct zone *zone, enum migratetype type)
+{
+	int order;
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for (order = MAX_PAGE_ORDER; order > zone->order; order--) {
+		struct page *page;
+		int split = 0;
+		struct free_area *area = zone->free_area + order;
+
+		while ((page = get_page_from_free_area(area, type))) {
+			del_page_from_free_list(page, zone, order);
+			expand(zone, page, zone->order, order, type);
+			set_buddy_order(page, zone->order);
+			add_to_free_list(page, zone, zone->order, type);
+			split++;
+		}
+
+		pr_info("  HVO: order %d split %d\n", order, split);
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static void hvo_nomerge_zone(struct zone *zone, enum migratetype type)
+{
+	LIST_HEAD(old);
+	LIST_HEAD(new);
+	int nomem, freed;
+	unsigned long flags;
+	struct list_head list;
+	struct page *page, *next;
+	struct free_area *area = zone->free_area + zone->order;
+again:
+	nomem = freed = 0;
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list_splice_init(area->free_list + type, &old);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	list_for_each_entry_safe(page, next, &old, buddy_list) {
+		unsigned long start = (unsigned long)page;
+		unsigned long end = start + hvo_order_size(zone->order);
+
+		if (WARN_ON_ONCE(!IS_ALIGNED(start, PAGE_SIZE)))
+			continue;
+
+		if (vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0))
+			nomem++;
+	}
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		if (PageReserved(page))
+			free_bootmem_page(page);
+		else
+			__free_page(page);
+		freed++;
+	}
+
+	list_splice_init(&old, &new);
+	atomic_long_add(freed, &zone->hvo_freed);
+
+	pr_info("  HVO: nomem %d freed %d\n", nomem, freed);
+
+	if (!list_empty(area->free_list + type))
+		goto again;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list_splice(&new, area->free_list + type);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static bool zone_hvo_suitable(struct zone *zone)
+{
+	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
+		return false;
+
+	return zone_idx(zone) == ZONE_NOMERGE && hvo_order_size(zone->order) > PAGE_SIZE;
+}
+
+void free_vmemmap(void)
+{
+	struct zone *zone;
+
+	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+	for_each_populated_zone(zone) {
+		if (!zone_hvo_suitable(zone))
+			continue;
+
+		pr_info("Freeing vmemmap of node %d zone %s\n",
+			 zone_to_nid(zone), zone->name);
+
+		prep_nomerge_zone(zone, MIGRATE_MOVABLE);
+		hvo_nomerge_zone(zone, MIGRATE_MOVABLE);
+
+		cond_resched();
+	}
+}
+
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
@@ -1565,6 +1710,8 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 		set_page_pfmemalloc(page);
 	else
 		clear_page_pfmemalloc(page);
+
+	prep_hvo_page(page, order);
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 0ddb28c52961..d339bf489230 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1143,6 +1143,10 @@ int folio_total_mapcount(struct folio *folio)
 	/* In the common case, avoid the loop when no pages mapped by PTE */
 	if (folio_nr_pages_mapped(folio) == 0)
 		return mapcount;
+
+	if (hvo_range_mapcount(folio, &folio->page, folio_nr_pages(folio), &mapcount))
+		return mapcount;
+
 	/*
 	 * Add all the PTE mappings of those pages mapped by PTE.
 	 * Limit the loop to folio_nr_pages_mapped()?
@@ -1168,6 +1172,8 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 
 	switch (level) {
 	case RMAP_LEVEL_PTE:
+		if (hvo_map_range(folio, page, nr_pages, &nr))
+			break;
 		do {
 			first = atomic_inc_and_test(&page->_mapcount);
 			if (first && folio_test_large(folio)) {
@@ -1314,6 +1320,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 	if (flags & RMAP_EXCLUSIVE) {
 		switch (level) {
 		case RMAP_LEVEL_PTE:
+			if (folio_is_hvo(folio))
+				break;
 			for (i = 0; i < nr_pages; i++)
 				SetPageAnonExclusive(page + i);
 			break;
@@ -1421,6 +1429,9 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	} else if (!folio_test_pmd_mappable(folio)) {
 		int i;
 
+		if (hvo_map_range(folio, &folio->page, nr, &nr))
+			goto done;
+
 		for (i = 0; i < nr; i++) {
 			struct page *page = folio_page(folio, i);
 
@@ -1437,7 +1448,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		SetPageAnonExclusive(&folio->page);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
-
+done:
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 }
 
@@ -1510,6 +1521,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 
 	switch (level) {
 	case RMAP_LEVEL_PTE:
+		if (hvo_unmap_range(folio, page, nr_pages, &nr))
+			break;
 		do {
 			last = atomic_add_negative(-1, &page->_mapcount);
 			if (last && folio_test_large(folio)) {
@@ -2212,7 +2225,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 			VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
-				       !anon_exclusive, subpage);
+				       !folio_is_hvo(folio) && !anon_exclusive, subpage);
 
 			/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
 			if (folio_test_hugetlb(folio)) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ff2114452334..f51f3b872270 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1704,6 +1704,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        present  %lu"
 		   "\n        managed  %lu"
 		   "\n        cma      %lu"
+		   "\n  hvo   freed    %lu"
 		   "\n        order    %u",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   zone->watermark_boost,
@@ -1714,6 +1715,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->present_pages,
 		   zone_managed_pages(zone),
 		   zone_cma_pages(zone),
+		   atomic_long_read(&zone->hvo_freed),
 		   zone->order);
 
 	seq_printf(m,
-- 
2.44.0.rc1.240.g4c46232300-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Epilogue] Profile-Guided Heap Optimization and THP fungibility
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
                   ` (2 preceding siblings ...)
  2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
@ 2024-02-29 18:34 ` Yu Zhao
  2024-03-05  8:37 ` [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Barry Song
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Yu Zhao @ 2024-02-29 18:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Jonathan Corbet, Yu Zhao

In a nutshell, Profile-Guided Heap Optimization (PGHO) [1] allows
userspace memory allocators, e.g., TCMalloc [2], to
1. Group memory objects by hotness so that the accessed bit in the PMD
   entry mapping a THP can better reflect the overall hotness of that
   THP. A counterexample is a single hot page shielding the rest of
   the cold pages in that THP from being reclaimed.
2. Group objects by lifetime to reduce the chance of split. Frequency
   split increases the entropy of a system and can cause a higher
   consumption of physical contiguity and reduced overall performance
   (due to TLB misses [2]).

None of PGOs (PGHO included) can account for every runtime behavior.
For example, an object predicated hot or long-lived can turn out to be
cold or short-lived. However, the kernel may not be able to reclaim
the THP containing that object because of the aforementioned reasons.
Instead, userspace memory allocators can choose to MADV_COLD or
MADV_FREE that object to avoid reclaiming other hot folios or OOM
kills. This is part of the whole process, called THP fungibility, and
it ends up with the split of the THP containing that object.

The full circle completes after userspace memory allocators recover
the THP from the split above. This part, called MADV_RECOVER, is done
by "collapsing" the pages of the original THP in place. Pages that
have been reused since the split are either reclaimed or migrated so
that they can become free again. Compared with MADV_COLLAPSE,
MADV_RECOVER has the following advantages:
1. It is more likely to succeed, since it does not try to allocate a
   new THP.
2. It does not copy the pages that are already in place and therefore
   has a smaller window during which the hot objects in those pages
   are inaccessible.

In essence, THP fungibility is a cooperation between the userspace
memory allocator and TAO to better utilize physical contiguity in a
system. It extends the heuristics for the bin packing problem from the
allocation time (mobility and size as described in Chapter One) to the
runtime (hotness and lifetime). Machine learning is likely to become
the autotuner in the foreseeable future, just as it has with the
software-defined far memory at Google [3].

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
[2] https://www.usenix.org/conference/osdi21/presentation/hunter
[3] https://research.google/pubs/software-defined-far-memory-in-warehouse-scale-computers/
-- 
2.44.0.rc1.240.g4c46232300-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
@ 2024-02-29 20:28   ` Matthew Wilcox
  2024-03-06  3:51     ` Yu Zhao
  2024-02-29 23:31   ` Yang Shi
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-02-29 20:28 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote:
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
>    higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
>    unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
>    contiguous PTEs on arm64 [1], which are more suitable for client
>    workloads.

Can this mechanism be used to fully replace the hugetlb pool approach?
That would be a major selling point.  It kind of feels like it should,
but I am insufficiently expert to be certain.

I'll read over the patches sometime soon.  There's a lot to go through.
Something I didn't see in the cover letter or commit messages was any
discussion of page->flags and how many bits we use for ZONE (particularly
on 32-bit).  Perhaps I'll discover the answer to that as I read.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Two] THP shattering: the reverse of collapsing
  2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
@ 2024-02-29 21:55   ` Zi Yan
  2024-03-03  1:17     ` Yu Zhao
  0 siblings, 1 reply; 28+ messages in thread
From: Zi Yan @ 2024-02-29 21:55 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]

On 29 Feb 2024, at 13:34, Yu Zhao wrote:

> In contrast to split, shatter migrates occupied pages in a partially
> mapped THP to a bunch of base folios. IOW, unlike split done in place,
> shatter is the exact opposite of collapse.
>
> The advantage of shattering is that it keeps the original THP intact.

Why keep the THP intact? To prevent the THP from fragmentation, since
the shattered part will not be returned to buddy allocator for reuse?

I agree with the idea of shattering, but keeping THP intact might
give us trouble for 1GB THP case when PMD mapping is created after
shattering. How to update mapcount for a PMD mapping in the middle of
a 1GB folio? I used head[0], head[512], ... as the PMD mapping head
page, but that is ugly. For mTHPs, there is no such problem since
only PTE mappings are involved.

It might be better to just split the THP and move free pages to a
donot-use free list until the rest are freed too, if the zone enforces
a minimal order that is larger than the free pages.

> The cost of copying during the migration is not a side effect, but
> rather by design, since splitting is considered a discouraged
> behavior. In retail terms, the return of a purchase is charged with a
> restocking fee and the original goods can be resold.
>
> THPs from ZONE_NOMERGE can only be shattered, since they cannot be
> split or merged. THPs from ZONE_NOSPLIT can be shattered or split (the
> latter requires [1]), if they are above the minimum order.
>
> [1] https://lore.kernel.org/20240226205534.1603748-1-zi.yan@sent.com/
>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Three] THP HVO: bring the hugeTLB feature to THP
  2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
@ 2024-02-29 22:54   ` Yang Shi
  2024-03-01 15:42     ` David Hildenbrand
  2024-03-03  1:46     ` Yu Zhao
  0 siblings, 2 replies; 28+ messages in thread
From: Yang Shi @ 2024-02-29 22:54 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
>
> HVO can be one of the perks for heavy THP users like it is for hugeTLB
> users. For example, if such a user uses 60% of physical memory for 2MB
> THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8
> ~= 50%).
>
> ZONE_NOMERGE considerably simplifies the implementation of HVO for
> THPs, since THPs from it cannot be split or merged and thus do not
> require any correctness-related operations on tail pages beyond the
> second one.
>
> If a THP is mapped by PTEs, two optimization-related operations on its
> tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to
> track a group of pages, e.g., eight pages per group for 2MB THPs. The
> estimation, as the copying cost incurred during shattering, is also by
> design, since mapping by PTEs is another discouraged behavior.

I'm confused by this. Can you please elaborate a little bit about
binning mapcount and PG_anon_exclusive?

For mapcount, IIUC, for example, when inc'ing a subpage's mapcount,
you actually inc the (i % 64) page's mapcount (assuming THP size is 2M
and base page size is 4K, so 8 strides and 64 pages in each stride),
right? But how you can tell each page of the 8 pages has mapcount 1 or
one page is mapped 8 times? Or this actually doesn't matter, we don't
even care to distinguish the two cases?

For PG_anon_exclusive, if one page has it set, it means other 7 pages
in other strides have it set too?

>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  include/linux/mm.h     | 140 ++++++++++++++++++++++++++++++++++++++
>  include/linux/mmzone.h |   1 +
>  include/linux/rmap.h   |   4 ++
>  init/main.c            |   1 +
>  mm/gup.c               |   3 +-
>  mm/huge_memory.c       |   2 +
>  mm/hugetlb_vmemmap.c   |   2 +-
>  mm/internal.h          |   9 ---
>  mm/memory.c            |  11 +--
>  mm/page_alloc.c        | 151 ++++++++++++++++++++++++++++++++++++++++-
>  mm/rmap.c              |  17 ++++-
>  mm/vmstat.c            |   2 +
>  12 files changed, 323 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f5a97dec5169..d7014fc35cca 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1196,6 +1196,138 @@ static inline void page_mapcount_reset(struct page *page)
>         atomic_set(&(page)->_mapcount, -1);
>  }
>
> +#define HVO_MOD (PAGE_SIZE / sizeof(struct page))
> +
> +static inline int hvo_order_size(int order)
> +{
> +       if (PAGE_SIZE % sizeof(struct page) || !is_power_of_2(HVO_MOD))
> +               return 0;
> +
> +       return (1 << order) * sizeof(struct page);
> +}
> +
> +static inline bool page_hvo_suitable(struct page *head, int order)
> +{
> +       VM_WARN_ON_ONCE_PAGE(!test_bit(PG_head, &head->flags), head);
> +
> +       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> +               return false;
> +
> +       return page_zonenum(head) == ZONE_NOMERGE &&
> +              IS_ALIGNED((unsigned long)head, PAGE_SIZE) &&
> +              hvo_order_size(order) > PAGE_SIZE;
> +}
> +
> +static inline bool folio_hvo_suitable(struct folio *folio)
> +{
> +       return folio_test_large(folio) && page_hvo_suitable(&folio->page, folio_order(folio));
> +}
> +
> +static inline bool page_is_hvo(struct page *head, int order)
> +{
> +       return page_hvo_suitable(head, order) && test_bit(PG_head, &head[HVO_MOD].flags);
> +}
> +
> +static inline bool folio_is_hvo(struct folio *folio)
> +{
> +       return folio_test_large(folio) && page_is_hvo(&folio->page, folio_order(folio));
> +}
> +
> +/*
> + * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> + * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> + * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
> + */
> +#define ENTIRELY_MAPPED                0x800000
> +#define FOLIO_PAGES_MAPPED     (ENTIRELY_MAPPED - 1)
> +
> +static inline int hvo_range_mapcount(struct folio *folio, struct page *page, int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret = folio_entire_mapcount(folio);
> +
> +       for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
> +               next = min(end, round_down(i + stride, stride));
> +
> +               page = folio_page(folio, i / stride);
> +               *ret += atomic_read(&page->_mapcount) + 1;
> +       }
> +
> +       return true;
> +}
> +
> +static inline bool hvo_map_range(struct folio *folio, struct page *page, int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret = 0;
> +
> +       for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
> +               next = min(end, round_down(i + stride, stride));
> +
> +               page = folio_page(folio, i / stride);
> +               if (atomic_add_return(next - i, &page->_mapcount) == next - i - 1)
> +                       *ret += stride;
> +       }
> +
> +       if (atomic_add_return(*ret, &folio->_nr_pages_mapped) >= ENTIRELY_MAPPED)
> +               *ret = 0;
> +
> +       return true;
> +}
> +
> +static inline bool hvo_unmap_range(struct folio *folio, struct page *page, int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret = 0;
> +
> +       for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
> +               next = min(end, round_down(i + stride, stride));
> +
> +               page = folio_page(folio, i / stride);
> +               if (atomic_sub_return(next - i, &page->_mapcount) == -1)
> +                       *ret += stride;
> +       }
> +
> +       if (atomic_sub_return(*ret, &folio->_nr_pages_mapped) >= ENTIRELY_MAPPED)
> +               *ret = 0;
> +
> +       return true;
> +}
> +
> +static inline bool hvo_dup_range(struct folio *folio, struct page *page, int nr_pages)
> +{
> +       int i, next, end;
> +       int stride = hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       for (i = folio_page_idx(folio, page), end = i + nr_pages; i != end; i = next) {
> +               next = min(end, round_down(i + stride, stride));
> +
> +               page = folio_page(folio, i / stride);
> +               atomic_add(next - i, &page->_mapcount);
> +       }
> +
> +       return true;
> +}
> +
>  /**
>   * page_mapcount() - Number of times this precise page is mapped.
>   * @page: The page.
> @@ -1212,6 +1344,9 @@ static inline int page_mapcount(struct page *page)
>  {
>         int mapcount = atomic_read(&page->_mapcount) + 1;
>
> +       if (hvo_range_mapcount(page_folio(page), page, 1, &mapcount))
> +               return mapcount;
> +
>         if (unlikely(PageCompound(page)))
>                 mapcount += folio_entire_mapcount(page_folio(page));
>
> @@ -3094,6 +3229,11 @@ static inline void pagetable_pud_dtor(struct ptdesc *ptdesc)
>
>  extern void __init pagecache_init(void);
>  extern void free_initmem(void);
> +extern void free_vmemmap(void);
> +extern int vmemmap_remap_free(unsigned long start, unsigned long end,
> +                             unsigned long reuse,
> +                             struct list_head *vmemmap_pages,
> +                             unsigned long flags);
>
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 532218167bba..00e4bb6c8533 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -916,6 +916,7 @@ struct zone {
>  #ifdef CONFIG_CMA
>         unsigned long           cma_pages;
>  #endif
> +       atomic_long_t           hvo_freed;
>
>         const char              *name;
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b7944a833668..d058c4cb3c96 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -322,6 +322,8 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_dup_range(folio, page, nr_pages))
> +                       break;
>                 do {
>                         atomic_inc(&page->_mapcount);
>                 } while (page++, --nr_pages > 0);
> @@ -401,6 +403,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
>                                 if (PageAnonExclusive(page + i))
>                                         return -EBUSY;
>                 }
> +               if (hvo_dup_range(folio, page, nr_pages))
> +                       break;
>                 do {
>                         if (PageAnonExclusive(page))
>                                 ClearPageAnonExclusive(page);
> diff --git a/init/main.c b/init/main.c
> index e24b0780fdff..74003495db32 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1448,6 +1448,7 @@ static int __ref kernel_init(void *unused)
>         kgdb_free_init_mem();
>         exit_boot_config();
>         free_initmem();
> +       free_vmemmap();
>         mark_readonly();
>
>         /*
> diff --git a/mm/gup.c b/mm/gup.c
> index df83182ec72d..f3df0078505b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -57,7 +57,7 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>                         continue;
>                 if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>                         VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> -               else
> +               else if (!folio_is_hvo(folio) || !folio_nr_pages_mapped(folio))
>                         /* Either a PTE-mapped or a PMD-mapped THP. */
>                         VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) &&
>                                        !PageAnonExclusive(page), page);
> @@ -645,6 +645,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>         }
>
>         VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> +                      !folio_is_hvo(page_folio(page)) &&
>                        !PageAnonExclusive(page), page);
>
>         /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 62d2254bc51c..9e7e5d587a5c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2535,6 +2535,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                  *
>                  * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
>                  */
> +               if (folio_is_hvo(folio))
> +                       ClearPageAnonExclusive(page);
>                 anon_exclusive = PageAnonExclusive(page);
>                 if (freeze && anon_exclusive &&
>                     folio_try_share_anon_rmap_pmd(folio, page))
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index da177e49d956..9f43d900e83c 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -310,7 +310,7 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> -static int vmemmap_remap_free(unsigned long start, unsigned long end,
> +int vmemmap_remap_free(unsigned long start, unsigned long end,
>                               unsigned long reuse,
>                               struct list_head *vmemmap_pages,
>                               unsigned long flags)
> diff --git a/mm/internal.h b/mm/internal.h
> index ac1d27468899..871c6eeb78b8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -52,15 +52,6 @@ struct folio_batch;
>
>  void page_writeback_init(void);
>
> -/*
> - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> - * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
> - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
> - */
> -#define ENTIRELY_MAPPED                0x800000
> -#define FOLIO_PAGES_MAPPED     (ENTIRELY_MAPPED - 1)
> -
>  /*
>   * Flags passed to __show_mem() and show_free_areas() to suppress output in
>   * various contexts.
> diff --git a/mm/memory.c b/mm/memory.c
> index 0bfc8b007c01..db389f1d776d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3047,8 +3047,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
>         VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
>
>         if (folio) {
> -               VM_BUG_ON(folio_test_anon(folio) &&
> -                         !PageAnonExclusive(vmf->page));
> +               VM_BUG_ON_PAGE(folio_test_anon(folio) && !folio_is_hvo(folio) &&
> +                              !PageAnonExclusive(vmf->page), vmf->page);
>                 /*
>                  * Clear the folio's cpupid information as the existing
>                  * information potentially belongs to a now completely
> @@ -3502,7 +3502,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>          */
>         if (folio && folio_test_anon(folio) &&
>             (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) {
> -               if (!PageAnonExclusive(vmf->page))
> +               if (!folio_is_hvo(folio) && !PageAnonExclusive(vmf->page))
>                         SetPageAnonExclusive(vmf->page);
>                 if (unlikely(unshare)) {
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4100,8 +4100,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                         rmap_flags);
>         }
>
> -       VM_BUG_ON(!folio_test_anon(folio) ||
> -                       (pte_write(pte) && !PageAnonExclusive(page)));
> +       VM_BUG_ON_PAGE(!folio_test_anon(folio) ||
> +                      (pte_write(pte) && !folio_is_hvo(folio) && !PageAnonExclusive(page)),
> +                      page);
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>         arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd843fb04f78..5f8c6583a191 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -53,6 +53,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/delayacct.h>
>  #include <linux/cacheinfo.h>
> +#include <linux/bootmem_info.h>
>  #include <asm/div64.h>
>  #include "internal.h"
>  #include "shuffle.h"
> @@ -585,6 +586,10 @@ void prep_compound_page(struct page *page, unsigned int order)
>         int nr_pages = 1 << order;
>
>         __SetPageHead(page);
> +
> +       if (page_is_hvo(page, order))
> +               nr_pages = HVO_MOD;
> +
>         for (i = 1; i < nr_pages; i++)
>                 prep_compound_tail(page, i);
>
> @@ -1124,10 +1129,15 @@ static __always_inline bool free_pages_prepare(struct page *page,
>          */
>         if (unlikely(order)) {
>                 int i;
> +               int nr_pages = 1 << order;
>
> -               if (compound)
> +               if (compound) {
> +                       if (page_is_hvo(page, order))
> +                               nr_pages = HVO_MOD;
>                         page[1].flags &= ~PAGE_FLAGS_SECOND;
> -               for (i = 1; i < (1 << order); i++) {
> +               }
> +
> +               for (i = 1; i < nr_pages; i++) {
>                         if (compound)
>                                 bad += free_tail_page_prepare(page, page + i);
>                         if (is_check_pages_enabled()) {
> @@ -1547,6 +1557,141 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>         page_table_check_alloc(page, order);
>  }
>
> +static void prep_hvo_page(struct page *head, int order)
> +{
> +       LIST_HEAD(list);
> +       struct page *page, *next;
> +       int freed = 0;
> +       unsigned long start = (unsigned long)head;
> +       unsigned long end = start + hvo_order_size(order);
> +
> +       if (page_zonenum(head) != ZONE_NOMERGE)
> +               return;
> +
> +       if (WARN_ON_ONCE(order != page_zone(head)->order)) {
> +               bad_page(head, "invalid page order");
> +               return;
> +       }
> +
> +       if (!page_hvo_suitable(head, order) || page_is_hvo(head, order))
> +               return;
> +
> +       vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0);
> +
> +       list_for_each_entry_safe(page, next, &list, lru) {
> +               if (PageReserved(page))
> +                       free_bootmem_page(page);
> +               else
> +                       __free_page(page);
> +               freed++;
> +       }
> +
> +       atomic_long_add(freed, &page_zone(head)->hvo_freed);
> +}
> +
> +static void prep_nomerge_zone(struct zone *zone, enum migratetype type)
> +{
> +       int order;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +
> +       for (order = MAX_PAGE_ORDER; order > zone->order; order--) {
> +               struct page *page;
> +               int split = 0;
> +               struct free_area *area = zone->free_area + order;
> +
> +               while ((page = get_page_from_free_area(area, type))) {
> +                       del_page_from_free_list(page, zone, order);
> +                       expand(zone, page, zone->order, order, type);
> +                       set_buddy_order(page, zone->order);
> +                       add_to_free_list(page, zone, zone->order, type);
> +                       split++;
> +               }
> +
> +               pr_info("  HVO: order %d split %d\n", order, split);
> +       }
> +
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static void hvo_nomerge_zone(struct zone *zone, enum migratetype type)
> +{
> +       LIST_HEAD(old);
> +       LIST_HEAD(new);
> +       int nomem, freed;
> +       unsigned long flags;
> +       struct list_head list;
> +       struct page *page, *next;
> +       struct free_area *area = zone->free_area + zone->order;
> +again:
> +       nomem = freed = 0;
> +       INIT_LIST_HEAD(&list);
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +       list_splice_init(area->free_list + type, &old);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +
> +       list_for_each_entry_safe(page, next, &old, buddy_list) {
> +               unsigned long start = (unsigned long)page;
> +               unsigned long end = start + hvo_order_size(zone->order);
> +
> +               if (WARN_ON_ONCE(!IS_ALIGNED(start, PAGE_SIZE)))
> +                       continue;
> +
> +               if (vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0))
> +                       nomem++;
> +       }
> +
> +       list_for_each_entry_safe(page, next, &list, lru) {
> +               if (PageReserved(page))
> +                       free_bootmem_page(page);
> +               else
> +                       __free_page(page);
> +               freed++;
> +       }
> +
> +       list_splice_init(&old, &new);
> +       atomic_long_add(freed, &zone->hvo_freed);
> +
> +       pr_info("  HVO: nomem %d freed %d\n", nomem, freed);
> +
> +       if (!list_empty(area->free_list + type))
> +               goto again;
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +       list_splice(&new, area->free_list + type);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static bool zone_hvo_suitable(struct zone *zone)
> +{
> +       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> +               return false;
> +
> +       return zone_idx(zone) == ZONE_NOMERGE && hvo_order_size(zone->order) > PAGE_SIZE;
> +}
> +
> +void free_vmemmap(void)
> +{
> +       struct zone *zone;
> +
> +       static_branch_inc(&hugetlb_optimize_vmemmap_key);
> +
> +       for_each_populated_zone(zone) {
> +               if (!zone_hvo_suitable(zone))
> +                       continue;
> +
> +               pr_info("Freeing vmemmap of node %d zone %s\n",
> +                        zone_to_nid(zone), zone->name);
> +
> +               prep_nomerge_zone(zone, MIGRATE_MOVABLE);
> +               hvo_nomerge_zone(zone, MIGRATE_MOVABLE);
> +
> +               cond_resched();
> +       }
> +}
> +
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>                                                         unsigned int alloc_flags)
>  {
> @@ -1565,6 +1710,8 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
>                 set_page_pfmemalloc(page);
>         else
>                 clear_page_pfmemalloc(page);
> +
> +       prep_hvo_page(page, order);
>  }
>
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0ddb28c52961..d339bf489230 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1143,6 +1143,10 @@ int folio_total_mapcount(struct folio *folio)
>         /* In the common case, avoid the loop when no pages mapped by PTE */
>         if (folio_nr_pages_mapped(folio) == 0)
>                 return mapcount;
> +
> +       if (hvo_range_mapcount(folio, &folio->page, folio_nr_pages(folio), &mapcount))
> +               return mapcount;
> +
>         /*
>          * Add all the PTE mappings of those pages mapped by PTE.
>          * Limit the loop to folio_nr_pages_mapped()?
> @@ -1168,6 +1172,8 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_map_range(folio, page, nr_pages, &nr))
> +                       break;
>                 do {
>                         first = atomic_inc_and_test(&page->_mapcount);
>                         if (first && folio_test_large(folio)) {
> @@ -1314,6 +1320,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>         if (flags & RMAP_EXCLUSIVE) {
>                 switch (level) {
>                 case RMAP_LEVEL_PTE:
> +                       if (folio_is_hvo(folio))
> +                               break;
>                         for (i = 0; i < nr_pages; i++)
>                                 SetPageAnonExclusive(page + i);
>                         break;
> @@ -1421,6 +1429,9 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>         } else if (!folio_test_pmd_mappable(folio)) {
>                 int i;
>
> +               if (hvo_map_range(folio, &folio->page, nr, &nr))
> +                       goto done;
> +
>                 for (i = 0; i < nr; i++) {
>                         struct page *page = folio_page(folio, i);
>
> @@ -1437,7 +1448,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>                 SetPageAnonExclusive(&folio->page);
>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>         }
> -
> +done:
>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>  }
>
> @@ -1510,6 +1521,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_unmap_range(folio, page, nr_pages, &nr))
> +                       break;
>                 do {
>                         last = atomic_add_negative(-1, &page->_mapcount);
>                         if (last && folio_test_large(folio)) {
> @@ -2212,7 +2225,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>                                 break;
>                         }
>                         VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
> -                                      !anon_exclusive, subpage);
> +                                      !folio_is_hvo(folio) && !anon_exclusive, subpage);
>
>                         /* See folio_try_share_anon_rmap_pte(): clear PTE first. */
>                         if (folio_test_hugetlb(folio)) {
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ff2114452334..f51f3b872270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1704,6 +1704,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        present  %lu"
>                    "\n        managed  %lu"
>                    "\n        cma      %lu"
> +                  "\n  hvo   freed    %lu"
>                    "\n        order    %u",
>                    zone_page_state(zone, NR_FREE_PAGES),
>                    zone->watermark_boost,
> @@ -1714,6 +1715,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    zone->present_pages,
>                    zone_managed_pages(zone),
>                    zone_cma_pages(zone),
> +                  atomic_long_read(&zone->hvo_freed),
>                    zone->order);
>
>         seq_printf(m,
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
  2024-02-29 20:28   ` Matthew Wilcox
@ 2024-02-29 23:31   ` Yang Shi
  2024-03-03  2:47     ` Yu Zhao
  2024-03-04 15:19   ` Matthew Wilcox
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2024-02-29 23:31 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
>
> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
>    memory.
> 2. The device zone provides interoperability between CPU and device
>    memory.
> 3. The movable zone commonly represents a memory allocation policy.
>
> Though originally designed for memory hot removal, the movable zone is
> instead widely used for other purposes, e.g., CMA and kdump kernel, on
> platforms that do not support hot removal, e.g., Android and ChromeOS.
> Nowadays, it is legitimately a zone independent of any physical
> characteristics. In spite of being somewhat regarded as a hack,
> largely due to the lack of a generic design concept for its true major
> use cases (on billions of client devices), the movable zone naturally
> resembles a policy (virtual) zone overlayed on the first four
> (physical) zones.
>
> This proposal formally generalizes this concept as policy zones so
> that additional policies can be implemented and enforced by subsequent
> zones after the movable zone. An inherited requirement of policy zones
> (and the first four zones) is that subsequent zones must be able to
> fall back to previous zones and therefore must add new properties to
> the previous zones rather than remove existing ones from them. Also,
> all properties must be known at the allocation time, rather than the
> runtime, e.g., memory object size and mobility are valid properties
> but hotness and lifetime are not.
>
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
>    ZONE_MOVABLE) and restricted to a minimum order to be
>    anti-fragmentation. The latter means that they cannot be split down
>    below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
>    to an exact order. The latter means that not only is split
>    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
>    reason in Chapter Three), while they are free or in use.
>
> Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> compaction is not needed for these two zones.
>
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
>    higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
>    unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
>    contiguous PTEs on arm64 [1], which are more suitable for client
>    workloads.

I think the allocation fallback policy needs to be elaborated. IIUC,
when allocating large folios, if the order > min order of the policy
zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
ZONE_MOVABLE    -> ZONE_NORMAL, right?

If all other zones are depleted, the allocation, whose order is < the
min order, won't fallback to the policy zones and will fail, just like
the non-movable allocation can't fallback to ZONE_MOVABLE even though
there is enough memory for that zone, right?

>
> Policy zones can be dynamically resized by offlining pages in one of
> them and onlining those pages in another of them. Note that this is
> only done among policy zones, not between a policy zone and a physical
> zone, since resizing is a (software) policy, not a physical
> characteristic.
>
> Implementing the same idea in the pageblock granularity has also been
> explored but rejected at Google. Pageblocks have a finer granularity
> and therefore can be more flexible than zones. The tradeoff is that
> this alternative implementation was more complex and failed to bring a
> better ROI. However, the rejection was mainly due to its inability to
> be smoothly extended to 1GB THPs [2], which is a planned use case of
> TAO.
>
> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  drivers/virtio/virtio_mem.c                   |   2 +-
>  include/linux/gfp.h                           |  24 +-
>  include/linux/huge_mm.h                       |   6 -
>  include/linux/mempolicy.h                     |   2 +-
>  include/linux/mmzone.h                        |  52 +-
>  include/linux/nodemask.h                      |   2 +-
>  include/linux/vm_event_item.h                 |   2 +-
>  include/trace/events/mmflags.h                |   4 +-
>  mm/compaction.c                               |  12 +
>  mm/huge_memory.c                              |   5 +-
>  mm/mempolicy.c                                |  14 +-
>  mm/migrate.c                                  |   7 +-
>  mm/mm_init.c                                  | 452 ++++++++++--------
>  mm/page_alloc.c                               |  44 +-
>  mm/page_isolation.c                           |   2 +-
>  mm/swap_slots.c                               |   3 +-
>  mm/vmscan.c                                   |  32 +-
>  mm/vmstat.c                                   |   7 +-
>  19 files changed, 431 insertions(+), 251 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 31b3a25680d0..a6c181f6efde 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3529,6 +3529,16 @@
>                         allocations which rules out almost all kernel
>                         allocations. Use with caution!
>
> +       nosplit=X,Y     [MM] Set the minimum order of the nosplit zone. Pages in
> +                       this zone can't be split down below order Y, while free
> +                       or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] or n%.
> +
> +       nomerge=X,Y     [MM] Set the exact orders of the nomerge zone. Pages in
> +                       this zone are always order Y, meaning they can't be
> +                       split or merged while free or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] or n%.
> +
>         MTD_Partition=  [MTD]
>                         Format: <name>,<region-number>,<size>,<offset>
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 8e3223294442..37ecf5ee4afd 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm,
>                 page = pfn_to_online_page(pfn);
>                 if (!page)
>                         continue;
> -               if (page_zonenum(page) != ZONE_MOVABLE)
> +               if (!is_zone_movable_page(page))
>                         return false;
>         }
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..c0f9d21b4d18 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>   * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
>   */
>
> -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> -/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4
> +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */
>  #define GFP_ZONES_SHIFT 2
>  #else
>  #define GFP_ZONES_SHIFT ZONES_SHIFT
> @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags)
>         z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
>                                          ((1 << GFP_ZONES_SHIFT) - 1);
>         VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> +
> +       if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP))
> +               return LAST_VIRT_ZONE;
> +
>         return z;
>  }
>
> +extern int zone_nomerge_order __read_mostly;
> +extern int zone_nosplit_order __read_mostly;
> +
> +static inline enum zone_type gfp_order_zone(gfp_t flags, int order)
> +{
> +       enum zone_type zid = gfp_zone(flags);
> +
> +       if (zid >= ZONE_NOMERGE && order != zone_nomerge_order)
> +               zid = ZONE_NOMERGE - 1;
> +
> +       if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order)
> +               zid = ZONE_NOSPLIT - 1;
> +
> +       return zid;
> +}
> +
>  /*
>   * There is only one page-allocator function, and two main namespaces to
>   * it. The alloc_page*() variants return 'struct page *' and as such
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..9960ad7c3b10 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>                 unsigned long len, unsigned long pgoff, unsigned long flags);
>
>  void folio_prep_large_rmappable(struct folio *folio);
> -bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {}
>
>  #define thp_get_unmapped_area  NULL
>
> -static inline bool
> -can_split_folio(struct folio *folio, int *pextra_pins)
> -{
> -       return false;
> -}
>  static inline int
>  split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..a92bcf47cf8c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,7 +150,7 @@ extern enum zone_type policy_zone;
>
>  static inline void check_highest_zone(enum zone_type k)
>  {
> -       if (k > policy_zone && k != ZONE_MOVABLE)
> +       if (k > policy_zone && !zid_is_virt(k))
>                 policy_zone = k;
>  }
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a497f189d988..532218167bba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -805,11 +805,15 @@ enum zone_type {
>          * there can be false negatives).
>          */
>         ZONE_MOVABLE,
> +       ZONE_NOSPLIT,
> +       ZONE_NOMERGE,
>  #ifdef CONFIG_ZONE_DEVICE
>         ZONE_DEVICE,
>  #endif
> -       __MAX_NR_ZONES
> +       __MAX_NR_ZONES,
>
> +       LAST_PHYS_ZONE = ZONE_MOVABLE - 1,
> +       LAST_VIRT_ZONE = ZONE_NOMERGE,
>  };
>
>  #ifndef __GENERATING_BOUNDS_H
> @@ -929,6 +933,8 @@ struct zone {
>         seqlock_t               span_seqlock;
>  #endif
>
> +       int order;
> +
>         int initialized;
>
>         /* Write-intensive fields used from the page allocator */
> @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio)
>
>  static inline bool is_zone_movable_page(const struct page *page)
>  {
> -       return page_zonenum(page) == ZONE_MOVABLE;
> +       return page_zonenum(page) >= ZONE_MOVABLE;
>  }
>
>  static inline bool folio_is_zone_movable(const struct folio *folio)
>  {
> -       return folio_zonenum(folio) == ZONE_MOVABLE;
> +       return folio_zonenum(folio) >= ZONE_MOVABLE;
> +}
> +
> +static inline bool page_can_split(struct page *page)
> +{
> +       return page_zonenum(page) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool folio_can_split(struct folio *folio)
> +{
> +       return folio_zonenum(folio) < ZONE_NOSPLIT;
>  }
>  #endif
>
> @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; };
>   */
>  #define zone_idx(zone)         ((zone) - (zone)->zone_pgdat->node_zones)
>
> +static inline bool zid_is_virt(enum zone_type zid)
> +{
> +       return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE;
> +}
> +
> +static inline bool zone_can_frag(struct zone *zone)
> +{
> +       VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT);
> +
> +       return zone_idx(zone) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool zone_is_suitable(struct zone *zone, int order)
> +{
> +       int zid = zone_idx(zone);
> +
> +       if (zid < ZONE_NOSPLIT)
> +               return true;
> +
> +       if (!zone->order)
> +               return false;
> +
> +       return (zid == ZONE_NOSPLIT && order >= zone->order) ||
> +              (zid == ZONE_NOMERGE && order == zone->order);
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  static inline bool zone_is_zone_device(struct zone *zone)
>  {
> @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone)
>  static inline void zone_set_nid(struct zone *zone, int nid) {}
>  #endif
>
> -extern int movable_zone;
> +extern int virt_zone;
>
>  static inline int is_highmem_idx(enum zone_type idx)
>  {
>  #ifdef CONFIG_HIGHMEM
>         return (idx == ZONE_HIGHMEM ||
> -               (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM));
> +               (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM));
>  #else
>         return 0;
>  #endif
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index b61438313a73..34fbe910576d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -404,7 +404,7 @@ enum node_states {
>  #else
>         N_HIGH_MEMORY = N_NORMAL_MEMORY,
>  #endif
> -       N_MEMORY,               /* The node has memory(regular, high, movable) */
> +       N_MEMORY,               /* The node has memory in any of the zones */
>         N_CPU,          /* The node has one or more cpus */
>         N_GENERIC_INITIATOR,    /* The node has one or more Generic Initiators */
>         NR_NODE_STATES
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..9a54d15d5ec3 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -27,7 +27,7 @@
>  #endif
>
>  #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \
> -       HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
> +       HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx)
>
>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 FOR_ALL_ZONES(PGALLOC)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index d801409b33cf..2b5fdafaadea 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>         IFDEF_ZONE_DMA32(       EM (ZONE_DMA32,  "DMA32"))      \
>                                 EM (ZONE_NORMAL, "Normal")      \
>         IFDEF_ZONE_HIGHMEM(     EM (ZONE_HIGHMEM,"HighMem"))    \
> -                               EMe(ZONE_MOVABLE,"Movable")
> +                               EM (ZONE_MOVABLE,"Movable")     \
> +                               EM (ZONE_NOSPLIT,"NoSplit")     \
> +                               EMe(ZONE_NOMERGE,"NoMerge")
>
>  #define LRU_NAMES              \
>                 EM (LRU_INACTIVE_ANON, "inactive_anon") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..8a64c805f411 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>                                         ac->highest_zoneidx, ac->nodemask) {
>                 enum compact_result status;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 if (prio > MIN_COMPACT_PRIORITY
>                                         && compaction_deferred(zone, order)) {
>                         rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
> @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone = zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2846,6 +2852,9 @@ static void compact_node(int nid)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone = zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 ret = compaction_suit_allocation_order(zone,
>                                 pgdat->kcompactd_max_order,
>                                 highest_zoneidx, ALLOC_WMARK_MIN);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c958f7ebb5..b57faa0a1e83 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  }
>
>  /* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int *pextra_pins)
> +static bool can_split_folio(struct folio *folio, int *pextra_pins)
>  {
>         int extra_pins;
>
> +       if (!folio_can_split(folio))
> +               return false;
> +
>         /* Additional pins from page cache */
>         if (folio_test_anon(folio))
>                 extra_pins = folio_test_swapcache(folio) ?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..1f84dd759086 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma)
>
>  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  {
> -       enum zone_type dynamic_policy_zone = policy_zone;
> -
> -       BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
> +       WARN_ON_ONCE(zid_is_virt(policy_zone));
>
>         /*
> -        * if policy->nodes has movable memory only,
> -        * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
> +        * If policy->nodes has memory in virtual zones only, we apply policy
> +        * only if gfp_zone(gfp) can allocate from those zones.
>          *
>          * policy->nodes is intersect with node_states[N_MEMORY].
>          * so if the following test fails, it implies
> -        * policy->nodes has movable memory only.
> +        * policy->nodes has memory in virtual zones only.
>          */
>         if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
> -               dynamic_policy_zone = ZONE_MOVABLE;
> +               return zone > LAST_PHYS_ZONE;
>
> -       return zone >= dynamic_policy_zone;
> +       return zone >= policy_zone;
>  }
>
>  /* Do dynamic interleaving for a process */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cc9f2bcd73b4..f615c0c22046 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>  {
>         int rc;
>
> +       if (!folio_can_split(folio))
> +               return -EBUSY;
> +
>         folio_lock(folio);
>         rc = split_folio_to_list(folio, split_folios);
>         folio_unlock(folio);
> @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
>                 order = folio_order(src);
>         }
>         zidx = zone_idx(folio_zone(src));
> -       if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> +       if (zidx > ZONE_NORMAL)
>                 gfp_mask |= __GFP_HIGHMEM;
>
>         return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
> @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>                                 break;
>                 }
>                 wakeup_kswapd(pgdat->node_zones + z, 0,
> -                             folio_order(folio), ZONE_MOVABLE);
> +                             folio_order(folio), z);
>                 return 0;
>         }
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36..7769c21e6d54 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init);
>
>  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
>  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
> -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
>
> -static unsigned long required_kernelcore __initdata;
> -static unsigned long required_kernelcore_percent __initdata;
> -static unsigned long required_movablecore __initdata;
> -static unsigned long required_movablecore_percent __initdata;
> +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata;
> +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid])
> +
> +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE])
> +
> +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE])
> +
> +int zone_nosplit_order __read_mostly;
> +int zone_nomerge_order __read_mostly;
>
>  static unsigned long nr_kernel_pages __initdata;
>  static unsigned long nr_all_pages __initdata;
> @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p)
>                 return 0;
>         }
>
> -       return cmdline_parse_core(p, &required_kernelcore,
> -                                 &required_kernelcore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE),
> +                                 &percentage_of(LAST_PHYS_ZONE));
>  }
>  early_param("kernelcore", cmdline_parse_kernelcore);
>
> @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore);
>   */
>  static int __init cmdline_parse_movablecore(char *p)
>  {
> -       return cmdline_parse_core(p, &required_movablecore,
> -                                 &required_movablecore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE),
> +                                 &percentage_of(ZONE_MOVABLE));
>  }
>  early_param("movablecore", cmdline_parse_movablecore);
>
> +static int __init parse_zone_order(char *p, unsigned long *nr_pages,
> +                                  unsigned long *percent, int *order)
> +{
> +       int err;
> +       unsigned long n;
> +       char *s = strchr(p, ',');
> +
> +       if (!s)
> +               return -EINVAL;
> +
> +       *s++ = '\0';
> +
> +       err = kstrtoul(s, 0, &n);
> +       if (err)
> +               return err;
> +
> +       if (n < 2 || n > MAX_PAGE_ORDER)
> +               return -EINVAL;
> +
> +       err = cmdline_parse_core(p, nr_pages, percent);
> +       if (err)
> +               return err;
> +
> +       *order = n;
> +
> +       return 0;
> +}
> +
> +static int __init parse_zone_nosplit(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT),
> +                               &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order);
> +}
> +early_param("nosplit", parse_zone_nosplit);
> +
> +static int __init parse_zone_nomerge(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE),
> +                               &percentage_of(ZONE_NOMERGE), &zone_nomerge_order);
> +}
> +early_param("nomerge", parse_zone_nomerge);
> +
>  /*
>   * early_calculate_totalpages()
> - * Sum pages in active regions for movable zone.
> + * Sum pages in active regions for virtual zones.
>   * Populate N_MEMORY for calculating usable_nodes.
>   */
>  static unsigned long __init early_calculate_totalpages(void)
> @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void)
>  }
>
>  /*
> - * This finds a zone that can be used for ZONE_MOVABLE pages. The
> + * This finds a physical zone that can be used for virtual zones. The
>   * assumption is made that zones within a node are ordered in monotonic
>   * increasing memory addresses so that the "highest" populated zone is used
>   */
> -static void __init find_usable_zone_for_movable(void)
> +static void __init find_usable_zone(void)
>  {
>         int zone_index;
> -       for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
> -               if (zone_index == ZONE_MOVABLE)
> -                       continue;
> -
> +       for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) {
>                 if (arch_zone_highest_possible_pfn[zone_index] >
>                                 arch_zone_lowest_possible_pfn[zone_index])
>                         break;
>         }
>
>         VM_BUG_ON(zone_index == -1);
> -       movable_zone = zone_index;
> +       virt_zone = zone_index;
> +}
> +
> +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn)
> +{
> +       int i, nid;
> +       unsigned long node_avg, remaining;
> +       int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> +       /* usable_startpfn is the lowest possible pfn virtual zones can be at */
> +       unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone];
> +
> +restart:
> +       /* Carve out memory as evenly as possible throughout nodes */
> +       node_avg = occupied / usable_nodes;
> +       for_each_node_state(nid, N_MEMORY) {
> +               unsigned long start_pfn, end_pfn;
> +
> +               /*
> +                * Recalculate node_avg if the division per node now exceeds
> +                * what is necessary to satisfy the amount of memory to carve
> +                * out.
> +                */
> +               if (occupied < node_avg)
> +                       node_avg = occupied / usable_nodes;
> +
> +               /*
> +                * As the map is walked, we track how much memory is usable
> +                * using remaining. When it is 0, the rest of the node is
> +                * usable.
> +                */
> +               remaining = node_avg;
> +
> +               /* Go through each range of PFNs within this node */
> +               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +                       unsigned long size_pages;
> +
> +                       start_pfn = max(start_pfn, zone_pfn[nid]);
> +                       if (start_pfn >= end_pfn)
> +                               continue;
> +
> +                       /* Account for what is only usable when carving out */
> +                       if (start_pfn < usable_startpfn) {
> +                               unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn;
> +
> +                               remaining -= min(nr_pages, remaining);
> +                               occupied -= min(nr_pages, occupied);
> +
> +                               /* Continue if range is now fully accounted */
> +                               if (end_pfn <= usable_startpfn) {
> +
> +                                       /*
> +                                        * Push zone_pfn to the end so that if
> +                                        * we have to carve out more across
> +                                        * nodes, we will not double account
> +                                        * here.
> +                                        */
> +                                       zone_pfn[nid] = end_pfn;
> +                                       continue;
> +                               }
> +                               start_pfn = usable_startpfn;
> +                       }
> +
> +                       /*
> +                        * The usable PFN range is from start_pfn->end_pfn.
> +                        * Calculate size_pages as the number of pages used.
> +                        */
> +                       size_pages = end_pfn - start_pfn;
> +                       if (size_pages > remaining)
> +                               size_pages = remaining;
> +                       zone_pfn[nid] = start_pfn + size_pages;
> +
> +                       /*
> +                        * Some memory was carved out, update counts and break
> +                        * if the request for this node has been satisfied.
> +                        */
> +                       occupied -= min(occupied, size_pages);
> +                       remaining -= size_pages;
> +                       if (!remaining)
> +                               break;
> +               }
> +       }
> +
> +       /*
> +        * If there is still more to carve out, we do another pass with one less
> +        * node in the count. This will push zone_pfn[nid] further along on the
> +        * nodes that still have memory until the request is fully satisfied.
> +        */
> +       usable_nodes--;
> +       if (usable_nodes && occupied > usable_nodes)
> +               goto restart;
>  }
>
>  /*
> @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void)
>   * memory. When they don't, some nodes will have more kernelcore than
>   * others
>   */
> -static void __init find_zone_movable_pfns_for_nodes(void)
> +static void __init find_virt_zones(void)
>  {
> -       int i, nid;
> +       int i;
> +       int nid;
>         unsigned long usable_startpfn;
> -       unsigned long kernelcore_node, kernelcore_remaining;
>         /* save the state before borrow the nodemask */
>         nodemask_t saved_node_state = node_states[N_MEMORY];
>         unsigned long totalpages = early_calculate_totalpages();
> -       int usable_nodes = nodes_weight(node_states[N_MEMORY]);
>         struct memblock_region *r;
> +       unsigned long occupied = 0;
>
> -       /* Need to find movable_zone earlier when movable_node is specified. */
> -       find_usable_zone_for_movable();
> +       /* Need to find virt_zone earlier when movable_node is specified. */
> +       find_usable_zone();
>
>         /*
>          * If movable_node is specified, ignore kernelcore and movablecore
> @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                         nid = memblock_get_region_node(r);
>
>                         usable_startpfn = PFN_DOWN(r->base);
> -                       zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid]) :
> +                       pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                                 continue;
>                         }
>
> -                       zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid]) :
> +                       pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                 goto out2;
>         }
>
> +       if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) {
> +               nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0;
> +               percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0;
> +               zone_nosplit_order = zone_nomerge_order = 0;
> +               pr_warn("zone %s order %d must be higher zone %s order %d\n",
> +                       zone_names[ZONE_NOMERGE], zone_nomerge_order,
> +                       zone_names[ZONE_NOSPLIT], zone_nosplit_order);
> +       }
> +
>         /*
>          * If kernelcore=nn% or movablecore=nn% was specified, calculate the
>          * amount of necessary memory.
>          */
> -       if (required_kernelcore_percent)
> -               required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
> -                                      10000UL;
> -       if (required_movablecore_percent)
> -               required_movablecore = (totalpages * 100 * required_movablecore_percent) /
> -                                       10000UL;
> +       for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) {
> +               if (percentage_of(i))
> +                       nr_pages_of(i) = totalpages * percentage_of(i) / 100;
> +
> +               nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES);
> +               occupied += nr_pages_of(i);
> +       }
>
>         /*
>          * If movablecore= was specified, calculate what size of
>          * kernelcore that corresponds so that memory usable for
>          * any allocation type is evenly spread. If both kernelcore
>          * and movablecore are specified, then the value of kernelcore
> -        * will be used for required_kernelcore if it's greater than
> -        * what movablecore would have allowed.
> +        * will be used if it's greater than what movablecore would have
> +        * allowed.
>          */
> -       if (required_movablecore) {
> -               unsigned long corepages;
> +       if (occupied < totalpages) {
> +               enum zone_type zid;
>
> -               /*
> -                * Round-up so that ZONE_MOVABLE is at least as large as what
> -                * was requested by the user
> -                */
> -               required_movablecore =
> -                       roundup(required_movablecore, MAX_ORDER_NR_PAGES);
> -               required_movablecore = min(totalpages, required_movablecore);
> -               corepages = totalpages - required_movablecore;
> -
> -               required_kernelcore = max(required_kernelcore, corepages);
> +               zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ?
> +                     LAST_PHYS_ZONE : ZONE_MOVABLE;
> +               nr_pages_of(zid) += totalpages - occupied;
>         }
>
>         /*
>          * If kernelcore was not specified or kernelcore size is larger
> -        * than totalpages, there is no ZONE_MOVABLE.
> +        * than totalpages, there are not virtual zones.
>          */
> -       if (!required_kernelcore || required_kernelcore >= totalpages)
> +       occupied = nr_pages_of(LAST_PHYS_ZONE);
> +       if (!occupied || occupied >= totalpages)
>                 goto out;
>
> -       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
> -       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
> +       for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> +               if (!nr_pages_of(i))
> +                       continue;
>
> -restart:
> -       /* Spread kernelcore memory as evenly as possible throughout nodes */
> -       kernelcore_node = required_kernelcore / usable_nodes;
> -       for_each_node_state(nid, N_MEMORY) {
> -               unsigned long start_pfn, end_pfn;
> -
> -               /*
> -                * Recalculate kernelcore_node if the division per node
> -                * now exceeds what is necessary to satisfy the requested
> -                * amount of memory for the kernel
> -                */
> -               if (required_kernelcore < kernelcore_node)
> -                       kernelcore_node = required_kernelcore / usable_nodes;
> -
> -               /*
> -                * As the map is walked, we track how much memory is usable
> -                * by the kernel using kernelcore_remaining. When it is
> -                * 0, the rest of the node is usable by ZONE_MOVABLE
> -                */
> -               kernelcore_remaining = kernelcore_node;
> -
> -               /* Go through each range of PFNs within this node */
> -               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> -                       unsigned long size_pages;
> -
> -                       start_pfn = max(start_pfn, zone_movable_pfn[nid]);
> -                       if (start_pfn >= end_pfn)
> -                               continue;
> -
> -                       /* Account for what is only usable for kernelcore */
> -                       if (start_pfn < usable_startpfn) {
> -                               unsigned long kernel_pages;
> -                               kernel_pages = min(end_pfn, usable_startpfn)
> -                                                               - start_pfn;
> -
> -                               kernelcore_remaining -= min(kernel_pages,
> -                                                       kernelcore_remaining);
> -                               required_kernelcore -= min(kernel_pages,
> -                                                       required_kernelcore);
> -
> -                               /* Continue if range is now fully accounted */
> -                               if (end_pfn <= usable_startpfn) {
> -
> -                                       /*
> -                                        * Push zone_movable_pfn to the end so
> -                                        * that if we have to rebalance
> -                                        * kernelcore across nodes, we will
> -                                        * not double account here
> -                                        */
> -                                       zone_movable_pfn[nid] = end_pfn;
> -                                       continue;
> -                               }
> -                               start_pfn = usable_startpfn;
> -                       }
> -
> -                       /*
> -                        * The usable PFN range for ZONE_MOVABLE is from
> -                        * start_pfn->end_pfn. Calculate size_pages as the
> -                        * number of pages used as kernelcore
> -                        */
> -                       size_pages = end_pfn - start_pfn;
> -                       if (size_pages > kernelcore_remaining)
> -                               size_pages = kernelcore_remaining;
> -                       zone_movable_pfn[nid] = start_pfn + size_pages;
> -
> -                       /*
> -                        * Some kernelcore has been met, update counts and
> -                        * break if the kernelcore for this node has been
> -                        * satisfied
> -                        */
> -                       required_kernelcore -= min(required_kernelcore,
> -                                                               size_pages);
> -                       kernelcore_remaining -= size_pages;
> -                       if (!kernelcore_remaining)
> -                               break;
> -               }
> +               find_virt_zone(occupied, &pfn_of(i, 0));
> +               occupied += nr_pages_of(i);
>         }
> -
> -       /*
> -        * If there is still required_kernelcore, we do another pass with one
> -        * less node in the count. This will push zone_movable_pfn[nid] further
> -        * along on the nodes that still have memory until kernelcore is
> -        * satisfied
> -        */
> -       usable_nodes--;
> -       if (usable_nodes && required_kernelcore > usable_nodes)
> -               goto restart;
> -
>  out2:
> -       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
> +       /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */
>         for (nid = 0; nid < MAX_NUMNODES; nid++) {
>                 unsigned long start_pfn, end_pfn;
> -
> -               zone_movable_pfn[nid] =
> -                       roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
> +               unsigned long prev_virt_zone_pfn = 0;
>
>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> -               if (zone_movable_pfn[nid] >= end_pfn)
> -                       zone_movable_pfn[nid] = 0;
> +
> +               for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> +                       pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES);
> +
> +                       if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn)
> +                               pfn_of(i, nid) = 0;
> +
> +                       if (pfn_of(i, nid))
> +                               prev_virt_zone_pfn = pfn_of(i, nid);
> +               }
>         }
> -
>  out:
>         /* restore the node_state */
>         node_states[N_MEMORY] = saved_node_state;
> @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone,
>  #endif
>
>  /*
> - * The zone ranges provided by the architecture do not include ZONE_MOVABLE
> - * because it is sized independent of architecture. Unlike the other zones,
> - * the starting point for ZONE_MOVABLE is not fixed. It may be different
> - * in each node depending on the size of each node and how evenly kernelcore
> - * is distributed. This helper function adjusts the zone ranges
> + * The zone ranges provided by the architecture do not include virtual zones
> + * because they are sized independent of architecture. Unlike physical zones,
> + * the starting point for the first populated virtual zone is not fixed. It may
> + * be different in each node depending on the size of each node and how evenly
> + * kernelcore is distributed. This helper function adjusts the zone ranges
>   * provided by the architecture for a given node by using the end of the
> - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
> - * zones within a node are in order of monotonic increases memory addresses
> + * highest usable zone for the first populated virtual zone. This preserves the
> + * assumption that zones within a node are in order of monotonic increases
> + * memory addresses.
>   */
> -static void __init adjust_zone_range_for_zone_movable(int nid,
> +static void __init adjust_zone_range(int nid,
>                                         unsigned long zone_type,
>                                         unsigned long node_end_pfn,
>                                         unsigned long *zone_start_pfn,
>                                         unsigned long *zone_end_pfn)
>  {
> -       /* Only adjust if ZONE_MOVABLE is on this node */
> -       if (zone_movable_pfn[nid]) {
> -               /* Size ZONE_MOVABLE */
> -               if (zone_type == ZONE_MOVABLE) {
> -                       *zone_start_pfn = zone_movable_pfn[nid];
> -                       *zone_end_pfn = min(node_end_pfn,
> -                               arch_zone_highest_possible_pfn[movable_zone]);
> +       int i = max_t(int, zone_type, LAST_PHYS_ZONE);
> +       unsigned long next_virt_zone_pfn = 0;
>
> -               /* Adjust for ZONE_MOVABLE starting within this range */
> -               } else if (!mirrored_kernelcore &&
> -                       *zone_start_pfn < zone_movable_pfn[nid] &&
> -                       *zone_end_pfn > zone_movable_pfn[nid]) {
> -                       *zone_end_pfn = zone_movable_pfn[nid];
> +       while (i++ < LAST_VIRT_ZONE) {
> +               if (pfn_of(i, nid)) {
> +                       next_virt_zone_pfn = pfn_of(i, nid);
> +                       break;
> +               }
> +       }
>
> -               /* Check if this whole range is within ZONE_MOVABLE */
> -               } else if (*zone_start_pfn >= zone_movable_pfn[nid])
> +       if (zone_type <= LAST_PHYS_ZONE) {
> +               if (!next_virt_zone_pfn)
> +                       return;
> +
> +               if (!mirrored_kernelcore &&
> +                   *zone_start_pfn < next_virt_zone_pfn &&
> +                   *zone_end_pfn > next_virt_zone_pfn)
> +                       *zone_end_pfn = next_virt_zone_pfn;
> +               else if (*zone_start_pfn >= next_virt_zone_pfn)
>                         *zone_start_pfn = *zone_end_pfn;
> +       } else if (zone_type <= LAST_VIRT_ZONE) {
> +               if (!pfn_of(zone_type, nid))
> +                       return;
> +
> +               if (next_virt_zone_pfn)
> +                       *zone_end_pfn = min3(next_virt_zone_pfn,
> +                                            node_end_pfn,
> +                                            arch_zone_highest_possible_pfn[virt_zone]);
> +               else
> +                       *zone_end_pfn = min(node_end_pfn,
> +                                           arch_zone_highest_possible_pfn[virt_zone]);
> +               *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid));
>         }
>  }
>
> @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
>          * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
>          * and vice versa.
>          */
> -       if (mirrored_kernelcore && zone_movable_pfn[nid]) {
> +       if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) {
>                 unsigned long start_pfn, end_pfn;
>                 struct memblock_region *r;
>
> @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
>         /* Get the start and end of the zone */
>         *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
>         *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
> -       adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn,
> -                                          zone_start_pfn, zone_end_pfn);
> +       adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn);
>
>         /* Check that this node has pages within the zone's required range */
>         if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
> @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
>  #if defined(CONFIG_MEMORY_HOTPLUG)
>                 zone->present_early_pages = real_size;
>  #endif
> +               if (i == ZONE_NOSPLIT)
> +                       zone->order = zone_nosplit_order;
> +               if (i == ZONE_NOMERGE)
> +                       zone->order = zone_nomerge_order;
>
>                 totalpages += spanned;
>                 realtotalpages += real_size;
> @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat)
>  {
>         enum zone_type zone_type;
>
> -       for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
> +       for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) {
>                 struct zone *zone = &pgdat->node_zones[zone_type];
>                 if (populated_zone(zone)) {
>                         if (IS_ENABLED(CONFIG_HIGHMEM))
> @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void)
>  void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>         unsigned long start_pfn, end_pfn;
> -       int i, nid, zone;
> +       int i, j, nid, zone;
>         bool descending;
>
>         /* Record where the zone boundaries are */
> @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>         start_pfn = PHYS_PFN(memblock_start_of_DRAM());
>         descending = arch_has_descending_max_zone_pfns();
>
> -       for (i = 0; i < MAX_NR_ZONES; i++) {
> +       for (i = 0; i <= LAST_PHYS_ZONE; i++) {
>                 if (descending)
> -                       zone = MAX_NR_ZONES - i - 1;
> +                       zone = LAST_PHYS_ZONE - i;
>                 else
>                         zone = i;
>
> -               if (zone == ZONE_MOVABLE)
> -                       continue;
> -
>                 end_pfn = max(max_zone_pfn[zone], start_pfn);
>                 arch_zone_lowest_possible_pfn[zone] = start_pfn;
>                 arch_zone_highest_possible_pfn[zone] = end_pfn;
> @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>                 start_pfn = end_pfn;
>         }
>
> -       /* Find the PFNs that ZONE_MOVABLE begins at in each node */
> -       memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> -       find_zone_movable_pfns_for_nodes();
> +       /* Find the PFNs that virtual zones begin at in each node */
> +       find_virt_zones();
>
>         /* Print out the zone ranges */
>         pr_info("Zone ranges:\n");
> -       for (i = 0; i < MAX_NR_ZONES; i++) {
> -               if (i == ZONE_MOVABLE)
> -                       continue;
> +       for (i = 0; i <= LAST_PHYS_ZONE; i++) {
>                 pr_info("  %-8s ", zone_names[i]);
>                 if (arch_zone_lowest_possible_pfn[i] ==
>                                 arch_zone_highest_possible_pfn[i])
> @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>                                         << PAGE_SHIFT) - 1);
>         }
>
> -       /* Print out the PFNs ZONE_MOVABLE begins at in each node */
> -       pr_info("Movable zone start for each node\n");
> -       for (i = 0; i < MAX_NUMNODES; i++) {
> -               if (zone_movable_pfn[i])
> -                       pr_info("  Node %d: %#018Lx\n", i,
> -                              (u64)zone_movable_pfn[i] << PAGE_SHIFT);
> +       /* Print out the PFNs virtual zones begin at in each node */
> +       for (; i <= LAST_VIRT_ZONE; i++) {
> +               pr_info("%s zone start for each node\n", zone_names[i]);
> +               for (j = 0; j < MAX_NUMNODES; j++) {
> +                       if (pfn_of(i, j))
> +                               pr_info("  Node %d: %#018Lx\n",
> +                                       j, (u64)pfn_of(i, j) << PAGE_SHIFT);
> +               }
>         }
>
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..6a4da8f8691c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = {
>          "HighMem",
>  #endif
>          "Movable",
> +        "NoSplit",
> +        "NoMerge",
>  #ifdef CONFIG_ZONE_DEVICE
>          "Device",
>  #endif
> @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1;
>  static int watermark_boost_factor __read_mostly = 15000;
>  static int watermark_scale_factor = 10;
>
> -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> -int movable_zone;
> -EXPORT_SYMBOL(movable_zone);
> +/* virt_zone is the "real" zone pages in virtual zones are taken from */
> +int virt_zone;
> +EXPORT_SYMBOL(virt_zone);
>
>  #if MAX_NUMNODES > 1
>  unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>         unsigned long higher_page_pfn;
>         struct page *higher_page;
>
> -       if (order >= MAX_PAGE_ORDER - 1)
> -               return false;
> -
>         higher_page_pfn = buddy_pfn & pfn;
>         higher_page = page + (higher_page_pfn - pfn);
>
> @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>                         NULL) != NULL;
>  }
>
> +static int zone_max_order(struct zone *zone)
> +{
> +       return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER;
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page,
>         unsigned long combined_pfn;
>         struct page *buddy;
>         bool to_tail;
> +       int max_order = zone_max_order(zone);
>
>         VM_BUG_ON(!zone_is_initialized(zone));
>         VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page,
>         VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>         VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> -       while (order < MAX_PAGE_ORDER) {
> +       while (order < max_order) {
>                 if (compaction_capture(capc, page, order, migratetype)) {
>                         __mod_zone_freepage_state(zone, -(1 << order),
>                                                                 migratetype);
> @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page,
>                 to_tail = true;
>         else if (is_shuffle_order(order))
>                 to_tail = shuffle_pick_tail();
> +       else if (order + 1 >= max_order)
> +               to_tail = false;
>         else
>                 to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
> @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page,
>         int mt;
>         int ret = 0;
>
> +       VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page);
> +
>         if (split_pfn_offset == 0)
>                 return ret;
>
> @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>         struct free_area *area;
>         struct page *page;
>
> +       VM_WARN_ON_ONCE(!zone_is_suitable(zone, order));
> +
>         /* Find a page of the appropriate size in the preferred list */
>         for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
>                 area = &(zone->free_area[current_order]);
> @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>         long min = mark;
>         int o;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         /* free_pages may go negative - that's OK */
>         free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>
> @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  {
>         long free_pages;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         free_pages = zone_page_state(z, NR_FREE_PAGES);
>
>         /*
> @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>                 struct page *page;
>                 unsigned long mark;
>
> +               if (!zone_is_suitable(zone, order))
> +                       continue;
> +
>                 if (cpusets_enabled() &&
>                         (alloc_flags & ALLOC_CPUSET) &&
>                         !__cpuset_zone_allowed(zone, gfp_mask))
> @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void)
>         struct zone *zone;
>         unsigned long flags;
>
> -       /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */
> +       /* Calculate total number of pages below ZONE_HIGHMEM */
>         for_each_zone(zone) {
> -               if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
> +               if (zone_idx(zone) <= ZONE_NORMAL)
>                         lowmem_pages += zone_managed_pages(zone);
>         }
>
> @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void)
>                 spin_lock_irqsave(&zone->lock, flags);
>                 tmp = (u64)pages_min * zone_managed_pages(zone);
>                 do_div(tmp, lowmem_pages);
> -               if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
> +               if (zone_idx(zone) > ZONE_NORMAL) {
>                         /*
>                          * __GFP_HIGH and PF_MEMALLOC allocations usually don't
> -                        * need highmem and movable zones pages, so cap pages_min
> -                        * to a small  value here.
> +                        * need pages from zones above ZONE_NORMAL, so cap
> +                        * pages_min to a small value here.
>                          *
>                          * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
>                          * deltas control async page reclaim, and so should
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index cd0ea3668253..8a6473543427 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>                  * pages then it should be reasonably safe to assume the rest
>                  * is movable.
>                  */
> -               if (zone_idx(zone) == ZONE_MOVABLE)
> +               if (zid_is_virt(zone_idx(zone)))
>                         continue;
>
>                 /*
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..ad0db0373b05 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>         entry.val = 0;
>
>         if (folio_test_large(folio)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() &&
> +                   folio_test_pmd_mappable(folio))
>                         get_swap_pages(1, &entry, folio_nr_pages(folio));
>                 goto out;
>         }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..ae061ec4866a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         goto keep_locked;
>                                 if (folio_maybe_dma_pinned(folio))
>                                         goto keep_locked;
> -                               if (folio_test_large(folio)) {
> -                                       /* cannot split folio, skip it */
> -                                       if (!can_split_folio(folio, NULL))
> -                                               goto activate_locked;
> -                                       /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> -                                        */
> -                                       if (!folio_entire_mapcount(folio) &&
> -                                           split_folio_to_list(folio,
> -                                                               folio_list))
> -                                               goto activate_locked;
> -                               }
> +                               /*
> +                                * Split folios that are not fully map right
> +                                * away. Chances are some of the tail pages can
> +                                * be freed without IO.
> +                                */
> +                               if (folio_test_large(folio) &&
> +                                   atomic_read(&folio->_nr_pages_mapped) < nr_pages)
> +                                       split_folio_to_list(folio, folio_list);
>                                 if (!add_to_swap(folio)) {
>                                         if (!folio_test_large(folio))
>                                                 goto activate_locked_split;
> @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>         orig_mask = sc->gfp_mask;
>         if (buffer_heads_over_limit) {
>                 sc->gfp_mask |= __GFP_HIGHMEM;
> -               sc->reclaim_idx = gfp_zone(sc->gfp_mask);
> +               sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order);
>         }
>
>         for_each_zone_zonelist_nodemask(zone, z, zonelist,
> @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>         struct scan_control sc = {
>                 .nr_to_reclaim = SWAP_CLUSTER_MAX,
>                 .gfp_mask = current_gfp_context(gfp_mask),
> -               .reclaim_idx = gfp_zone(gfp_mask),
> +               .reclaim_idx = gfp_order_zone(gfp_mask, order),
>                 .order = order,
>                 .nodemask = nodemask,
>                 .priority = DEF_PRIORITY,
> @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
>         if (!cpuset_zone_allowed(zone, gfp_flags))
>                 return;
>
> +       curr_idx = gfp_order_zone(gfp_flags, order);
> +       if (highest_zoneidx > curr_idx)
> +               highest_zoneidx = curr_idx;
> +
>         pgdat = zone->zone_pgdat;
>         curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx);
>
> @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>                 .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
>                 .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
>                 .may_swap = 1,
> -               .reclaim_idx = gfp_zone(gfp_mask),
> +               .reclaim_idx = gfp_order_zone(gfp_mask, order),
>         };
>         unsigned long pflags;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..adbd032e6a0f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>
>  #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
>                                         TEXT_FOR_HIGHMEM(xx) xx "_movable", \
> +                                       xx "_nosplit", xx "_nomerge", \
>                                         TEXT_FOR_DEVICE(xx)
>
>  const char * const vmstat_text[] = {
> @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu"
> -                  "\n        cma      %lu",
> +                  "\n        cma      %lu"
> +                  "\n        order    %u",
>                    zone_page_state(zone, NR_FREE_PAGES),
>                    zone->watermark_boost,
>                    min_wmark_pages(zone),
> @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone_managed_pages(zone),
> -                  zone_cma_pages(zone));
> +                  zone_cma_pages(zone),
> +                  zone->order);
>
>         seq_printf(m,
>                    "\n        protection: (%ld",
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Three] THP HVO: bring the hugeTLB feature to THP
  2024-02-29 22:54   ` Yang Shi
@ 2024-03-01 15:42     ` David Hildenbrand
  2024-03-03  1:46     ` Yu Zhao
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-03-01 15:42 UTC (permalink / raw)
  To: Yang Shi, Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On 29.02.24 23:54, Yang Shi wrote:
> On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> HVO can be one of the perks for heavy THP users like it is for hugeTLB
>> users. For example, if such a user uses 60% of physical memory for 2MB
>> THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8
>> ~= 50%).
>>
>> ZONE_NOMERGE considerably simplifies the implementation of HVO for
>> THPs, since THPs from it cannot be split or merged and thus do not
>> require any correctness-related operations on tail pages beyond the
>> second one.
>>
>> If a THP is mapped by PTEs, two optimization-related operations on its
>> tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to
>> track a group of pages, e.g., eight pages per group for 2MB THPs. The
>> estimation, as the copying cost incurred during shattering, is also by
>> design, since mapping by PTEs is another discouraged behavior.
> 
> I'm confused by this. Can you please elaborate a little bit about
> binning mapcount and PG_anon_exclusive?
> 
> For mapcount, IIUC, for example, when inc'ing a subpage's mapcount,
> you actually inc the (i % 64) page's mapcount (assuming THP size is 2M
> and base page size is 4K, so 8 strides and 64 pages in each stride),
> right? But how you can tell each page of the 8 pages has mapcount 1 or
> one page is mapped 8 times? Or this actually doesn't matter, we don't
> even care to distinguish the two cases?

I'm hoping we won't need such elaborate approaches that make the 
mapcounts even more complicated in the future.

Just like for hugetlb HGM (if it ever becomes real), I'm hoping that we 
can just avoid subpage mapcounts completely, at least in some kernel 
configs initially.

I was looking into having only a single PAE bit this week, but 
migration+swapout are (again) giving me a really hard time. In theory 
it's simple, the corner cases are killing me.

What I really dislike about PAE right now is not necessarily the space, 
but that they reside in multiple cachelines and that we have to use 
atomic operations to set/clear them simply because other page flags 
might be set concurrently. PAE can only be set/cleared while holding the 
page table lock already, so I really want to avoid atomics.

I have not given up on a single PAE bit per folio, but the alternative I 
was thinking about this week was simply allocating the space required 
for maintaining them and storing a pointer to that in the (anon) folio. 
Not perfect.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Two] THP shattering: the reverse of collapsing
  2024-02-29 21:55   ` Zi Yan
@ 2024-03-03  1:17     ` Yu Zhao
  2024-03-03  1:21       ` Zi Yan
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Zhao @ 2024-03-03  1:17 UTC (permalink / raw)
  To: Zi Yan; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 4:55 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Feb 2024, at 13:34, Yu Zhao wrote:
>
> > In contrast to split, shatter migrates occupied pages in a partially
> > mapped THP to a bunch of base folios. IOW, unlike split done in place,
> > shatter is the exact opposite of collapse.
> >
> > The advantage of shattering is that it keeps the original THP intact.
>
> Why keep the THP intact? To prevent the THP from fragmentation, since
> the shattered part will not be returned to buddy allocator for reuse?

There might be a confusion here: there is no "shattered part" -- the
entire THP becomes free after shattering (the occupied part is moved
to a bunch of 4KB pages).

> I agree with the idea of shattering, but keeping THP intact might
> give us trouble for 1GB THP case when PMD mapping is created after
> shattering. How to update mapcount for a PMD mapping in the middle of
> a 1GB folio? I used head[0], head[512], ... as the PMD mapping head
> page, but that is ugly. For mTHPs, there is no such problem since
> only PTE mappings are involved.

If we don't consider the copying cost during shattering, it can work
for 1GB THPs as it does for 2MB THPs.

> It might be better to just split the THP and move free pages to a
> donot-use free list until the rest are freed too

The main reason we do shattering is, using a crude analogy, a million
dollar in $10,000 bills (yes, they exist) is worth a lot more than
that in pennies. You can carry the former in your pocket but the
latter weighs at least 250 tons. So if we split, we lose money.

1GB THP is one of the important *end goals* for TAO. But I don't want
to go into details since we need to focus on the first few steps at
the current stage.

The problem with shattering for 1GB is the copying cost -- if we
shatter a 1GB THP half mapped/unmapped, we'd have to copy 512MB data,
which is unacceptable. 1GB THP requires something we call "THP
fungibility" (see the epilogue) -- we do split in place, but we also
"collapse" in place (called THP recovery, i.e., MADV_RECOVERY).

Shattering is for 2MB THPs only.


> if the zone enforces
> a minimal order that is larger than the free pages.
>
> > The cost of copying during the migration is not a side effect, but
> > rather by design, since splitting is considered a discouraged
> > behavior. In retail terms, the return of a purchase is charged with a
> > restocking fee and the original goods can be resold.
> >
> > THPs from ZONE_NOMERGE can only be shattered, since they cannot be
> > split or merged. THPs from ZONE_NOSPLIT can be shattered or split (the
> > latter requires [1]), if they are above the minimum order.
> >
> > [1] https://lore.kernel.org/20240226205534.1603748-1-zi.yan@sent.com/
> >
>
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Two] THP shattering: the reverse of collapsing
  2024-03-03  1:17     ` Yu Zhao
@ 2024-03-03  1:21       ` Zi Yan
  0 siblings, 0 replies; 28+ messages in thread
From: Zi Yan @ 2024-03-03  1:21 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]

On 2 Mar 2024, at 20:17, Yu Zhao wrote:

> On Thu, Feb 29, 2024 at 4:55 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 29 Feb 2024, at 13:34, Yu Zhao wrote:
>>
>>> In contrast to split, shatter migrates occupied pages in a partially
>>> mapped THP to a bunch of base folios. IOW, unlike split done in place,
>>> shatter is the exact opposite of collapse.
>>>
>>> The advantage of shattering is that it keeps the original THP intact.
>>
>> Why keep the THP intact? To prevent the THP from fragmentation, since
>> the shattered part will not be returned to buddy allocator for reuse?
>
> There might be a confusion here: there is no "shattered part" -- the
> entire THP becomes free after shattering (the occupied part is moved
> to a bunch of 4KB pages).

Got it. Now it makes more sense. Thanks.

>
>> I agree with the idea of shattering, but keeping THP intact might
>> give us trouble for 1GB THP case when PMD mapping is created after
>> shattering. How to update mapcount for a PMD mapping in the middle of
>> a 1GB folio? I used head[0], head[512], ... as the PMD mapping head
>> page, but that is ugly. For mTHPs, there is no such problem since
>> only PTE mappings are involved.
>
> If we don't consider the copying cost during shattering, it can work
> for 1GB THPs as it does for 2MB THPs.
>
>> It might be better to just split the THP and move free pages to a
>> donot-use free list until the rest are freed too
>
> The main reason we do shattering is, using a crude analogy, a million
> dollar in $10,000 bills (yes, they exist) is worth a lot more than
> that in pennies. You can carry the former in your pocket but the
> latter weighs at least 250 tons. So if we split, we lose money.
>
> 1GB THP is one of the important *end goals* for TAO. But I don't want
> to go into details since we need to focus on the first few steps at
> the current stage.
>
> The problem with shattering for 1GB is the copying cost -- if we
> shatter a 1GB THP half mapped/unmapped, we'd have to copy 512MB data,
> which is unacceptable. 1GB THP requires something we call "THP
> fungibility" (see the epilogue) -- we do split in place, but we also
> "collapse" in place (called THP recovery, i.e., MADV_RECOVERY).
>
> Shattering is for 2MB THPs only.

Got it. Thanks for the explanation.

>
>
>> if the zone enforces
>> a minimal order that is larger than the free pages.
>>
>>> The cost of copying during the migration is not a side effect, but
>>> rather by design, since splitting is considered a discouraged
>>> behavior. In retail terms, the return of a purchase is charged with a
>>> restocking fee and the original goods can be resold.
>>>
>>> THPs from ZONE_NOMERGE can only be shattered, since they cannot be
>>> split or merged. THPs from ZONE_NOSPLIT can be shattered or split (the
>>> latter requires [1]), if they are above the minimum order.
>>>
>>> [1] https://lore.kernel.org/20240226205534.1603748-1-zi.yan@sent.com/
>>>
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter Three] THP HVO: bring the hugeTLB feature to THP
  2024-02-29 22:54   ` Yang Shi
  2024-03-01 15:42     ` David Hildenbrand
@ 2024-03-03  1:46     ` Yu Zhao
  1 sibling, 0 replies; 28+ messages in thread
From: Yu Zhao @ 2024-03-03  1:46 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 5:54 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > HVO can be one of the perks for heavy THP users like it is for hugeTLB
> > users. For example, if such a user uses 60% of physical memory for 2MB
> > THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8
> > ~= 50%).
> >
> > ZONE_NOMERGE considerably simplifies the implementation of HVO for
> > THPs, since THPs from it cannot be split or merged and thus do not
> > require any correctness-related operations on tail pages beyond the
> > second one.
> >
> > If a THP is mapped by PTEs, two optimization-related operations on its
> > tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to
> > track a group of pages, e.g., eight pages per group for 2MB THPs. The
> > estimation, as the copying cost incurred during shattering, is also by
> > design, since mapping by PTEs is another discouraged behavior.
>
> I'm confused by this. Can you please elaborate a little bit about
> binning mapcount and PG_anon_exclusive?
>
> For mapcount, IIUC, for example, when inc'ing a subpage's mapcount,
> you actually inc the (i % 64) page's mapcount (assuming THP size is 2M
> and base page size is 4K, so 8 strides and 64 pages in each stride),
> right?

Correct.

> But how you can tell each page of the 8 pages has mapcount 1 or
> one page is mapped 8 times?

We can't :)

> Or this actually doesn't matter, we don't
> even care to distinguish the two cases?

Exactly.

> For PG_anon_exclusive, if one page has it set, it means other 7 pages
> in other strides have it set too?

Correct. We leverage the fact that they (_mapcount and
PG_anon_exclusive) are optimizations, overestimating _mapcount and
underestimating PG_anon_exclusive (both are for worst) can only affect
the performance for PTE-mapped THPs (as a punishment for splitting).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 23:31   ` Yang Shi
@ 2024-03-03  2:47     ` Yu Zhao
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Zhao @ 2024-03-03  2:47 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 6:31 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > There are three types of zones:
> > 1. The first four zones partition the physical address space of CPU
> >    memory.
> > 2. The device zone provides interoperability between CPU and device
> >    memory.
> > 3. The movable zone commonly represents a memory allocation policy.
> >
> > Though originally designed for memory hot removal, the movable zone is
> > instead widely used for other purposes, e.g., CMA and kdump kernel, on
> > platforms that do not support hot removal, e.g., Android and ChromeOS.
> > Nowadays, it is legitimately a zone independent of any physical
> > characteristics. In spite of being somewhat regarded as a hack,
> > largely due to the lack of a generic design concept for its true major
> > use cases (on billions of client devices), the movable zone naturally
> > resembles a policy (virtual) zone overlayed on the first four
> > (physical) zones.
> >
> > This proposal formally generalizes this concept as policy zones so
> > that additional policies can be implemented and enforced by subsequent
> > zones after the movable zone. An inherited requirement of policy zones
> > (and the first four zones) is that subsequent zones must be able to
> > fall back to previous zones and therefore must add new properties to
> > the previous zones rather than remove existing ones from them. Also,
> > all properties must be known at the allocation time, rather than the
> > runtime, e.g., memory object size and mobility are valid properties
> > but hotness and lifetime are not.
> >
> > ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> > zones:
> > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
> >    ZONE_MOVABLE) and restricted to a minimum order to be
> >    anti-fragmentation. The latter means that they cannot be split down
> >    below that order, while they are free or in use.
> > 2. ZONE_NOMERGE, which contains pages that are movable and restricted
> >    to an exact order. The latter means that not only is split
> >    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
> >    reason in Chapter Three), while they are free or in use.
> >
> > Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> > compaction is not needed for these two zones.
> >
> > Compared with the hugeTLB pool approach, THP zones tap into core MM
> > features including:
> > 1. THP allocations can fall back to the lower zones, which can have
> >    higher latency but still succeed.
> > 2. THPs can be either shattered (see Chapter Two) if partially
> >    unmapped or reclaimed if becoming cold.
> > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
> >    contiguous PTEs on arm64 [1], which are more suitable for client
> >    workloads.
>
> I think the allocation fallback policy needs to be elaborated. IIUC,
> when allocating large folios, if the order > min order of the policy
> zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
> ZONE_MOVABLE    -> ZONE_NORMAL, right?

Correct.

> If all other zones are depleted, the allocation, whose order is < the
> min order, won't fallback to the policy zones and will fail, just like
> the non-movable allocation can't fallback to ZONE_MOVABLE even though
> there is enough memory for that zone, right?

Correct. In this case, the userspace can consider dynamic resizing.
(The resizing patches are not included since, as I said in the other
thread, we need to focus on the first few steps at the current stage.)

Naturally, the next question would be why creating this whole new
process rather than trying to improve compaction. We did try the
latter: on servers, we tuned compaction and had some good improvements
but soon hit a new wall; on clients, no luck at all because 1) they
are usually under a much higher pressure than servers 2) they are more
sensitive to latency.

So we needed a *more deterministic* approach when dealing with
fragmentation. Unlike compaction which I'd call heuristics, resizing
is more of a policy that the userspace can have full control over.
Obviously leaving the task to the userspace can be a good or bad
thing, depending on the point view.

The bottomline is:
1. The resizing would also help the *existing* unbalanced
ZONE_MOVABLE/other zones problem, for the non-hot-removal case.
2. Enlarging the THP zones would be more likely to succeed than
compaction would, because it targets the blocks it "donated" to
ZONE_MOVABLE with everything it got (both migration and reclaim) and
it keeps at it until it succeeds, whereas the compaction lacks such
laser focus and is more of a best-efforts approach.

(Needless to say, shrinking the THP zones can always succeed.)


> > Policy zones can be dynamically resized by offlining pages in one of
> > them and onlining those pages in another of them. Note that this is
> > only done among policy zones, not between a policy zone and a physical
> > zone, since resizing is a (software) policy, not a physical
> > characteristic.
> >
> > Implementing the same idea in the pageblock granularity has also been
> > explored but rejected at Google. Pageblocks have a finer granularity
> > and therefore can be more flexible than zones. The tradeoff is that
> > this alternative implementation was more complex and failed to bring a
> > better ROI. However, the rejection was mainly due to its inability to
> > be smoothly extended to 1GB THPs [2], which is a planned use case of
> > TAO.
> >
> > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
  2024-02-29 20:28   ` Matthew Wilcox
  2024-02-29 23:31   ` Yang Shi
@ 2024-03-04 15:19   ` Matthew Wilcox
  2024-03-05 17:22     ` Matthew Wilcox
  2024-03-05  8:41   ` Barry Song
  2024-05-24  8:38   ` Barry Song
  4 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-03-04 15:19 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote:
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
>    ZONE_MOVABLE) and restricted to a minimum order to be
>    anti-fragmentation. The latter means that they cannot be split down
>    below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
>    to an exact order. The latter means that not only is split
>    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
>    reason in Chapter Three), while they are free or in use.

These two zones end up solving a problem for memdescs.  So I'm in favour!
I added Option 5 to https://kernelnewbies.org/MatthewWilcox/BuddyAllocator

I think this patch needs to be split into more digestable chunks, but a
quick skim of it didn't reveal anything egregiously wrong.  I do still
have that question about the number of bits used for Zone in
page->flags.  Probably this all needs to be dependent on CONFIG_64BIT?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
                   ` (3 preceding siblings ...)
  2024-02-29 18:34 ` [Epilogue] Profile-Guided Heap Optimization and THP fungibility Yu Zhao
@ 2024-03-05  8:37 ` Barry Song
  2024-03-06 15:51 ` Johannes Weiner
       [not found] ` <CAOUHufZE87n_rR9EdjHLkyOt7T+Bc3a9g1Kct6cnGHXo26TDcQ@mail.gmail.com>
  6 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2024-03-05  8:37 UTC (permalink / raw)
  To: yuzhao; +Cc: corbet, linux-mm, lsf-pc

> TAO is an umbrella project aiming at a better economy of physical
> contiguity viewed as a valuable resource. A few examples are:
> 1. A multi-tenant system can have guaranteed THP coverage while
>    hosting abusers/misusers of the resource.
> 2. Abusers/misusers, e.g., workloads excessively requesting and then
>    splitting THPs, should be punished if necessary.
> 3. Good citizens should be awarded with, e.g., lower allocation
>    latency and less cost of metadata (struct page).

I think TAO or similar optimization in buddy is essential to the
success of mTHP.

Ryan's recent mTHP work can widely bring multi-size large folios to
various products while THP might be too large for them.

But a pain is that the buddy of a real device with limited memory
can be seriously fragmented after it runs for some time.

We(OPPO) have actually brought up mTHP-like features on millions of
phones even on 5.4, 5.10, 5.15 and 6.1 kernel with large folios
whose size are 64KiB to leverage ARM64's CONT-PTE. The open source
code for kernel 6.1 can be got here[1]. We found the success rate
of 64KiB allocation could be very low after running monkey[2] on
phones for one hour.
 
After the phone has been running for one hour, the below is the
data we collected from 60mins to 120mins(the second hour). w/o
TAO-like optimization to the existing buddy, 64KiB large folios
allocation can fall back to small folios at the rate of 92.35%
in do_anonymous_page().

thp_do_anon_pages_fallback / (thp_do_anon_pages + thp_do_anon_pages_fallback)
25807330 / 27944523 =  0.9235

in do_anonymous_page(), thp_do_anon_pages_fallback is the number
we try to allocate 64KiB but we fail, thus, we use small folios
instead; thp_do_anon_pages is the number we try to allocate 64KiB
and we succeed.

So this number somehow means mTHP has lost vast majority of value
on a fragmented system, while the fragmentation is always true
for a phone.

This has actually pushed us to implement a similar optimization
to avoid splitting 64KiB and award 64KiB allocation with lower
latency. Our implementation is different with TAO, rather than
adding new zones, we are adding migration_types to mark some
pageblocks are dedicated for mTHP allocation. And we avoid
splitting them into lower orders except for some corner cases.
This has significantly improved our success rate of 64KiB
large folios allocation and decreased the latency, helped
large folios to be finally applied in real products.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/
[2] https://developer.android.com/studio/test/other-testing-tools/monkey

> 4. Better interoperability with userspace memory allocators when
>    transacting the resource.
> 
> This project puts the same emphasis on the established use case for
> servers and the emerging use case for clients so that client workloads
> like Android and ChromeOS can leverage the recent multi-sized THPs
> [1][2].

> Chapter One introduces the cornerstone of TAO: an abstraction called
> policy (virtual) zones, which are overlayed on the physical zones.
> This is in line with item 1 above.
> 
> A new door is open after Chapter One. The following two chapters
> discuss the reverse of THP collapsing, called THP shattering, and THP
> HVO, which brings the hugeTLB feature [3] to THP. They are in line
> with items 2 & 3 above.
> 
> Advanced use cases are discussed in Epilogue, since they require the
> cooperation of userspace memory allocators. This is in line with item
> 4 above.
> 
> [1] https://lwn.net/Articles/932386/
> [2] https://lwn.net/Articles/937239/
> [3] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> 
> Yu Zhao (4):
>   THP zones: the use cases of policy zones
>   THP shattering: the reverse of collapsing
>   THP HVO: bring the hugeTLB feature to THP
>   Profile-Guided Heap Optimization and THP fungibility

Thanks
Barry



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
                     ` (2 preceding siblings ...)
  2024-03-04 15:19   ` Matthew Wilcox
@ 2024-03-05  8:41   ` Barry Song
  2024-03-05 10:07     ` Vlastimil Babka
  2024-05-24  8:38   ` Barry Song
  4 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-05  8:41 UTC (permalink / raw)
  To: yuzhao; +Cc: corbet, linux-mm, lsf-pc

> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
>    memory.
> 2. The device zone provides interoperability between CPU and device
>    memory.
> 3. The movable zone commonly represents a memory allocation policy.
> 
> Though originally designed for memory hot removal, the movable zone is
> instead widely used for other purposes, e.g., CMA and kdump kernel, on
> platforms that do not support hot removal, e.g., Android and ChromeOS.
> Nowadays, it is legitimately a zone independent of any physical
> characteristics. In spite of being somewhat regarded as a hack,
> largely due to the lack of a generic design concept for its true major
> use cases (on billions of client devices), the movable zone naturally
> resembles a policy (virtual) zone overlayed on the first four
> (physical) zones.
> 
> This proposal formally generalizes this concept as policy zones so
> that additional policies can be implemented and enforced by subsequent
> zones after the movable zone. An inherited requirement of policy zones
> (and the first four zones) is that subsequent zones must be able to
> fall back to previous zones and therefore must add new properties to
> the previous zones rather than remove existing ones from them. Also,
> all properties must be known at the allocation time, rather than the
> runtime, e.g., memory object size and mobility are valid properties
> but hotness and lifetime are not.
> 
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
>    ZONE_MOVABLE) and restricted to a minimum order to be
>    anti-fragmentation. The latter means that they cannot be split down
>    below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
>    to an exact order. The latter means that not only is split
>    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
>    reason in Chapter Three), while they are free or in use.
> 
> Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> compaction is not needed for these two zones.
> 
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
>    higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
>    unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
>    contiguous PTEs on arm64 [1], which are more suitable for client
>    workloads.
> 
> Policy zones can be dynamically resized by offlining pages in one of
> them and onlining those pages in another of them. Note that this is
> only done among policy zones, not between a policy zone and a physical
> zone, since resizing is a (software) policy, not a physical
> characteristic.
> 
> Implementing the same idea in the pageblock granularity has also been
> explored but rejected at Google. Pageblocks have a finer granularity
> and therefore can be more flexible than zones. The tradeoff is that
> this alternative implementation was more complex and failed to bring a
> better ROI. However, the rejection was mainly due to its inability to
> be smoothly extended to 1GB THPs [2], which is a planned use case of
> TAO.

We did implement similar idea in the pageblock granularity on OPPO's
phones by extending two special migratetypes[1]:

* QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use
ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain
of 3-order allocation if and only if 3-order allocation has failed in both
normal buddy and the below TRIP_TO_QUAD.
  
* TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use
ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the
pain of 3-order allocation if and only if 3-order allocation has failed in
normal buddy.

neither of above will be merged into 5 order or above; neither of above
will be splitted into 2 order or lower.

in compaction, we will skip both of above. I am seeing one disadvantage
of this approach is that I have to add a separate LRU list in each
zone to place those mTHP folios. if mTHP and small folios are put
in the same LRU list, the reclamation efficiency is extremely bad.

A separate zone, on the other hand, can avoid a separate LRU list
for mTHP as the new zone has its own LRU list.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c

> 
> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>

Thanks
Barry



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-03-05  8:41   ` Barry Song
@ 2024-03-05 10:07     ` Vlastimil Babka
  2024-03-05 21:04       ` Barry Song
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2024-03-05 10:07 UTC (permalink / raw)
  To: Barry Song, yuzhao; +Cc: corbet, linux-mm, lsf-pc

On 3/5/24 09:41, Barry Song wrote:
> We did implement similar idea in the pageblock granularity on OPPO's
> phones by extending two special migratetypes[1]:
> 
> * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use
> ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain
> of 3-order allocation if and only if 3-order allocation has failed in both
> normal buddy and the below TRIP_TO_QUAD.
>   
> * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use
> ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the
> pain of 3-order allocation if and only if 3-order allocation has failed in
> normal buddy.
> 
> neither of above will be merged into 5 order or above; neither of above
> will be splitted into 2 order or lower.
> 
> in compaction, we will skip both of above. I am seeing one disadvantage
> of this approach is that I have to add a separate LRU list in each
> zone to place those mTHP folios. if mTHP and small folios are put
> in the same LRU list, the reclamation efficiency is extremely bad.
> 
> A separate zone, on the other hand, can avoid a separate LRU list
> for mTHP as the new zone has its own LRU list.

But we switched from per-zone to per-node LRU lists years ago?
Is that actually a complication for the policy zones? Or does this work
silently assume multigen lru which (IIRC) works differently?


> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c
> 
>>
>> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
> 
> Thanks
> Barry
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-03-04 15:19   ` Matthew Wilcox
@ 2024-03-05 17:22     ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2024-03-05 17:22 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Mon, Mar 04, 2024 at 03:19:42PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote:
> > ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> > zones:
> > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
> >    ZONE_MOVABLE) and restricted to a minimum order to be
> >    anti-fragmentation. The latter means that they cannot be split down
> >    below that order, while they are free or in use.
> > 2. ZONE_NOMERGE, which contains pages that are movable and restricted
> >    to an exact order. The latter means that not only is split
> >    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
> >    reason in Chapter Three), while they are free or in use.
> 
> These two zones end up solving a problem for memdescs.  So I'm in favour!
> I added Option 5 to https://kernelnewbies.org/MatthewWilcox/BuddyAllocator

I realised that we don't even need a doubly-linked-list for ZONE_NOMERGE
(would ZONE_FIXEDSIZE be a better name?)  We only need a oubly linked
list to make removal from the middle of the list an O(1) operation, and
we only remove from the middle of a list when merging.  So we can simply
keep a stack of free "pages" and we have 60 bits to point to the next
memdesc, so we can easily cover all memory that can exist in a 64-bit
machine in ZONE_NOMERGE.  ZONE_NOSPLIT would be limited to the first 1PB
of memory (assuming it has a minimum size of 2MB -- with 29 bits to
refer to each of next & prev, 29 + 21 = 50 bits of address space).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-03-05 10:07     ` Vlastimil Babka
@ 2024-03-05 21:04       ` Barry Song
  2024-03-06  3:05         ` Yu Zhao
  0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-05 21:04 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: yuzhao, corbet, linux-mm, lsf-pc

On Tue, Mar 5, 2024 at 11:07 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/5/24 09:41, Barry Song wrote:
> > We did implement similar idea in the pageblock granularity on OPPO's
> > phones by extending two special migratetypes[1]:
> >
> > * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use
> > ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain
> > of 3-order allocation if and only if 3-order allocation has failed in both
> > normal buddy and the below TRIP_TO_QUAD.
> >
> > * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use
> > ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the
> > pain of 3-order allocation if and only if 3-order allocation has failed in
> > normal buddy.
> >
> > neither of above will be merged into 5 order or above; neither of above
> > will be splitted into 2 order or lower.
> >
> > in compaction, we will skip both of above. I am seeing one disadvantage
> > of this approach is that I have to add a separate LRU list in each
> > zone to place those mTHP folios. if mTHP and small folios are put
> > in the same LRU list, the reclamation efficiency is extremely bad.
> >
> > A separate zone, on the other hand, can avoid a separate LRU list
> > for mTHP as the new zone has its own LRU list.
>
> But we switched from per-zone to per-node LRU lists years ago?
> Is that actually a complication for the policy zones? Or does this work
> silently assume multigen lru which (IIRC) works differently?

the latter. based on the below code, i believe mglru is different
with active/inactive,

void lru_gen_init_lruvec(struct lruvec *lruvec)
{
        int i;
        int gen, type, zone;
        struct lru_gen_folio *lrugen = &lruvec->lrugen;

        lrugen->max_seq = MIN_NR_GENS + 1;
        lrugen->enabled = lru_gen_enabled();

        for (i = 0; i <= MIN_NR_GENS + 1; i++)
                lrugen->timestamps[i] = jiffies;

        for_each_gen_type_zone(gen, type, zone)
                INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]);

        lruvec->mm_state.seq = MIN_NR_GENS;
}

A fundamental difference is that mglru has a different aging and
eviction mechanism,
This can synchronize the LRUs of each zone to move forward at the same
pace while
the active/inactive might be unable to compare the ages of folios across zones.

>
>
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c
> >
> >>
> >> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> >> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-03-05 21:04       ` Barry Song
@ 2024-03-06  3:05         ` Yu Zhao
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Zhao @ 2024-03-06  3:05 UTC (permalink / raw)
  To: Barry Song; +Cc: Vlastimil Babka, corbet, linux-mm, lsf-pc

On Tue, Mar 5, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 11:07 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 3/5/24 09:41, Barry Song wrote:
> > > We did implement similar idea in the pageblock granularity on OPPO's
> > > phones by extending two special migratetypes[1]:
> > >
> > > * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use
> > > ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain
> > > of 3-order allocation if and only if 3-order allocation has failed in both
> > > normal buddy and the below TRIP_TO_QUAD.
> > >
> > > * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use
> > > ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the
> > > pain of 3-order allocation if and only if 3-order allocation has failed in
> > > normal buddy.
> > >
> > > neither of above will be merged into 5 order or above; neither of above
> > > will be splitted into 2 order or lower.
> > >
> > > in compaction, we will skip both of above. I am seeing one disadvantage
> > > of this approach is that I have to add a separate LRU list in each
> > > zone to place those mTHP folios. if mTHP and small folios are put
> > > in the same LRU list, the reclamation efficiency is extremely bad.
> > >
> > > A separate zone, on the other hand, can avoid a separate LRU list
> > > for mTHP as the new zone has its own LRU list.
> >
> > But we switched from per-zone to per-node LRU lists years ago?
> > Is that actually a complication for the policy zones? Or does this work
> > silently assume multigen lru which (IIRC) works differently?
>
> the latter. based on the below code, i believe mglru is different
> with active/inactive,
>
> void lru_gen_init_lruvec(struct lruvec *lruvec)
> {
>         int i;
>         int gen, type, zone;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>
>         lrugen->max_seq = MIN_NR_GENS + 1;
>         lrugen->enabled = lru_gen_enabled();
>
>         for (i = 0; i <= MIN_NR_GENS + 1; i++)
>                 lrugen->timestamps[i] = jiffies;
>
>         for_each_gen_type_zone(gen, type, zone)
>                 INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]);
>
>         lruvec->mm_state.seq = MIN_NR_GENS;
> }
>
> A fundamental difference is that mglru has a different aging and
> eviction mechanism,
> This can synchronize the LRUs of each zone to move forward at the same
> pace while
> the active/inactive might be unable to compare the ages of folios across zones.

That's correct. The active/inactive should also work with the extra
zones, just like it does for ZONE_MOVABLE. But it's not as optimized
as MGLRU, e.g., targeting eligible zones without search the entire LRU
list containing folios from all zones.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 20:28   ` Matthew Wilcox
@ 2024-03-06  3:51     ` Yu Zhao
  2024-03-06  4:33       ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Zhao @ 2024-03-06  3:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Thu, Feb 29, 2024 at 3:28 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote:
> > Compared with the hugeTLB pool approach, THP zones tap into core MM
> > features including:
> > 1. THP allocations can fall back to the lower zones, which can have
> >    higher latency but still succeed.
> > 2. THPs can be either shattered (see Chapter Two) if partially
> >    unmapped or reclaimed if becoming cold.
> > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
> >    contiguous PTEs on arm64 [1], which are more suitable for client
> >    workloads.
>
> Can this mechanism be used to fully replace the hugetlb pool approach?
> That would be a major selling point.  It kind of feels like it should,
> but I am insufficiently expert to be certain.

This depends on the return value from htlb_alloc_mask(): if it's
GFP_HIGHUSER_MOVABLE, then yes (i.e., 2MB hugeTLB folios on x86).
Hypothetically, if users can have THPs as reliable as hugeTLB can
offer, wouldn't most users want to go with the former since it's more
flexible? E.g., core MM features like split (shattering) and reclaim
in addition to HVO.

> I'll read over the patches sometime soon.  There's a lot to go through.
> Something I didn't see in the cover letter or commit messages was any
> discussion of page->flags and how many bits we use for ZONE (particularly
> on 32-bit).  Perhaps I'll discover the answer to that as I read.

There may be corner cases because of how different architectures use
page->flags, but in general, this shouldn't be a big problem because
we can have 6 zones (at most) before this series, and after this
series, we can have 8 (at most). IOW, we need 3 bits regardless, in
order to all existing zones.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-03-06  3:51     ` Yu Zhao
@ 2024-03-06  4:33       ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2024-03-06  4:33 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet

On Tue, Mar 05, 2024 at 10:51:20PM -0500, Yu Zhao wrote:
> On Thu, Feb 29, 2024 at 3:28 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote:
> > > Compared with the hugeTLB pool approach, THP zones tap into core MM
> > > features including:
> > > 1. THP allocations can fall back to the lower zones, which can have
> > >    higher latency but still succeed.
> > > 2. THPs can be either shattered (see Chapter Two) if partially
> > >    unmapped or reclaimed if becoming cold.
> > > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
> > >    contiguous PTEs on arm64 [1], which are more suitable for client
> > >    workloads.
> >
> > Can this mechanism be used to fully replace the hugetlb pool approach?
> > That would be a major selling point.  It kind of feels like it should,
> > but I am insufficiently expert to be certain.
> 
> This depends on the return value from htlb_alloc_mask(): if it's
> GFP_HIGHUSER_MOVABLE, then yes (i.e., 2MB hugeTLB folios on x86).
> Hypothetically, if users can have THPs as reliable as hugeTLB can
> offer, wouldn't most users want to go with the former since it's more
> flexible? E.g., core MM features like split (shattering) and reclaim
> in addition to HVO.

Right; the real question is what can we do to unify hugetlbfs and THPs.
The reservation ability is one feature that hugetlbfs has over THP and
removing that advantage gets us one step closer.

> > I'll read over the patches sometime soon.  There's a lot to go through.
> > Something I didn't see in the cover letter or commit messages was any
> > discussion of page->flags and how many bits we use for ZONE (particularly
> > on 32-bit).  Perhaps I'll discover the answer to that as I read.
> 
> There may be corner cases because of how different architectures use
> page->flags, but in general, this shouldn't be a big problem because
> we can have 6 zones (at most) before this series, and after this
> series, we can have 8 (at most). IOW, we need 3 bits regardless, in
> order to all existing zones.

On a 32-bit system, we'll typically only have four; DMA, NORMAL, HIGHMEM
and MOVABLE.  DMA32 will be skipped since it would match NORMAL, and
DEVICE is just not supported on 32-bit.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
  2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
                   ` (4 preceding siblings ...)
  2024-03-05  8:37 ` [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Barry Song
@ 2024-03-06 15:51 ` Johannes Weiner
  2024-03-06 16:40   ` Zi Yan
  2024-03-13 22:09   ` Kaiyang Zhao
       [not found] ` <CAOUHufZE87n_rR9EdjHLkyOt7T+Bc3a9g1Kct6cnGHXo26TDcQ@mail.gmail.com>
  6 siblings, 2 replies; 28+ messages in thread
From: Johannes Weiner @ 2024-03-06 15:51 UTC (permalink / raw)
  To: Yu Zhao; +Cc: lsf-pc, linux-mm, Jonathan Corbet, Kaiyang Zhao

On Thu, Feb 29, 2024 at 11:34:32AM -0700, Yu Zhao wrote:
> TAO is an umbrella project aiming at a better economy of physical
> contiguity viewed as a valuable resource. A few examples are:
> 1. A multi-tenant system can have guaranteed THP coverage while
>    hosting abusers/misusers of the resource.
> 2. Abusers/misusers, e.g., workloads excessively requesting and then
>    splitting THPs, should be punished if necessary.
> 3. Good citizens should be awarded with, e.g., lower allocation
>    latency and less cost of metadata (struct page).
> 4. Better interoperability with userspace memory allocators when
>    transacting the resource.
> 
> This project puts the same emphasis on the established use case for
> servers and the emerging use case for clients so that client workloads
> like Android and ChromeOS can leverage the recent multi-sized THPs
> [1][2].
> 
> Chapter One introduces the cornerstone of TAO: an abstraction called
> policy (virtual) zones, which are overlayed on the physical zones.
> This is in line with item 1 above.

This is a very interesting topic to me. Meta has collaborated with CMU
to research this as well, the results of which are typed up here:
https://dl.acm.org/doi/pdf/10.1145/3579371.3589079

We had used a dynamic CMA region, but unless I'm missing something
about the policy zones this is just another way to skin the cat.

The other difference is that we made the policy about migratetypes
rather than order. The advantage of doing it by order is of course
that you can forego a lot of compaction work to begin with. The
downside is that you have to be more precise and proactive about
sizing the THP vs non-THP regions correctly, as it's more restrictive
than saying "this region just has to remain compactable, but is good
for small and large pages" - most workloads will have a mix of those.

For region sizing, I see that for now you have boot parameters. But
the exact composition of orders that a system needs is going to vary
by workload, and likely within workloads over time. IMO some form of
auto-sizing inside the kernel will make the difference between this
being a general-purpose OS feature and "this is useful to hyperscalers
that control their whole stack, have resources to profile their
applications in-depth, and can tailor-make kernel policies around the
results" - not unlike hugetlb itself.

What we had experimented with is a feedback system between the
regions. It tracks the amount of memory pressure that exists for the
pages in each section - i.e. how much reclaim and compaction is needed
to satisfy allocations from a given region, and how many refaults and
swapins are occuring in them - and then move the boundaries
accordingly if there is an imbalance.

The first draft of this was an extension to psi to track pressure by
allocation context. This worked quite well, but was a little fat on
the scheduler cacheline footprint. Kaiyang (CC'd) has been working on
tracking these input metrics in a leaner fashion.

You mentioned a pageblock-oriented solution also in Chapter One. I had
proposed one before, so I'm obviously biased, but my gut feeling is
that we likely need both - one for 2MB and smaller, and one for
1GB. My thinking is this:

1. Contiguous zones are more difficult and less reliable to resize at
   runtime, and the huge page size you're trying to grow and shrink
   the regions for matters. Assuming 4k pages (wild, I know) there are
   512 pages in a 2MB folio, but a quarter million pages in a 1GB
   folio. It's much easier for a single die-hard kernel allocation to
   get in the way of expanding the THP region by another 1GB page than
   finding 512 disjunct 2MB pageblocks somewhere.

   Basically, dynamic adaptiveness of the pool seems necessary for a
   general-purpose THP[tm] feature, but also think adaptiveness for 1G
   huge pages is going to be difficult to pull off reliably, simply
   because we have no control over the lifetime of kernel allocations.

2. I think there also remains a difference in audience. Reliable
   coverage of up to 2MB would be a huge boon for most workloads,
   especially the majority of those that are not optimized much for
   contiguity. IIRC Willy mentioned before somewhere that nowdays the
   optimal average page size is still in the multi-k range.

   1G huge pages are immensely useful for specific loads - we
   certainly have our share of those as well. But the step size to 1GB
   is so large that:

   1) it's fewer applications that can benefit in the first place

   2) it requires applications to participate more proactively in the
      contiguity efforts to keep internal fragmentation reasonable

   3) the 1G huge pages are more expensive and less reliable when it
      comes to growing the THP region by another page at runtime,
      which remains a forcing function for static, boot-time configs

   4) the performance impact of falling back from 1G to 2MB or 4k
      would be quite large compared to falling back from 2M. Setups
      that invest to overcome all of the above difficulties in order
      to tickle more cycles out of their systems are going to be less
      tolerant of just falling back to smaller pages

   As you can see, points 2-4 take a lot of the "transparent" out of
   "transparent huge pages".

So it might be best to do both, and have each one do their thing well.

Anyway, I think this would be a super interesting and important
discussion to have at LSFMM. Thanks for proposing this.

I would like to be part of it, and would also suggest to have Kaiyang
(CC'd) in the room, who is the primary researcher on the Contiguitas
paper referenced above.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
  2024-03-06 15:51 ` Johannes Weiner
@ 2024-03-06 16:40   ` Zi Yan
  2024-03-13 22:09   ` Kaiyang Zhao
  1 sibling, 0 replies; 28+ messages in thread
From: Zi Yan @ 2024-03-06 16:40 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Yu Zhao, lsf-pc, linux-mm, Jonathan Corbet, Kaiyang Zhao

[-- Attachment #1: Type: text/plain, Size: 6411 bytes --]

On 6 Mar 2024, at 10:51, Johannes Weiner wrote:

> On Thu, Feb 29, 2024 at 11:34:32AM -0700, Yu Zhao wrote:
>> TAO is an umbrella project aiming at a better economy of physical
>> contiguity viewed as a valuable resource. A few examples are:
>> 1. A multi-tenant system can have guaranteed THP coverage while
>>    hosting abusers/misusers of the resource.
>> 2. Abusers/misusers, e.g., workloads excessively requesting and then
>>    splitting THPs, should be punished if necessary.
>> 3. Good citizens should be awarded with, e.g., lower allocation
>>    latency and less cost of metadata (struct page).
>> 4. Better interoperability with userspace memory allocators when
>>    transacting the resource.
>>
>> This project puts the same emphasis on the established use case for
>> servers and the emerging use case for clients so that client workloads
>> like Android and ChromeOS can leverage the recent multi-sized THPs
>> [1][2].
>>
>> Chapter One introduces the cornerstone of TAO: an abstraction called
>> policy (virtual) zones, which are overlayed on the physical zones.
>> This is in line with item 1 above.
>
> This is a very interesting topic to me. Meta has collaborated with CMU
> to research this as well, the results of which are typed up here:
> https://dl.acm.org/doi/pdf/10.1145/3579371.3589079
>
> We had used a dynamic CMA region, but unless I'm missing something
> about the policy zones this is just another way to skin the cat.
>
> The other difference is that we made the policy about migratetypes
> rather than order. The advantage of doing it by order is of course
> that you can forego a lot of compaction work to begin with. The
> downside is that you have to be more precise and proactive about
> sizing the THP vs non-THP regions correctly, as it's more restrictive
> than saying "this region just has to remain compactable, but is good
> for small and large pages" - most workloads will have a mix of those.
>
> For region sizing, I see that for now you have boot parameters. But
> the exact composition of orders that a system needs is going to vary
> by workload, and likely within workloads over time. IMO some form of
> auto-sizing inside the kernel will make the difference between this
> being a general-purpose OS feature and "this is useful to hyperscalers
> that control their whole stack, have resources to profile their
> applications in-depth, and can tailor-make kernel policies around the
> results" - not unlike hugetlb itself.
>
> What we had experimented with is a feedback system between the
> regions. It tracks the amount of memory pressure that exists for the
> pages in each section - i.e. how much reclaim and compaction is needed
> to satisfy allocations from a given region, and how many refaults and
> swapins are occuring in them - and then move the boundaries
> accordingly if there is an imbalance.
>
> The first draft of this was an extension to psi to track pressure by
> allocation context. This worked quite well, but was a little fat on
> the scheduler cacheline footprint. Kaiyang (CC'd) has been working on
> tracking these input metrics in a leaner fashion.
>
> You mentioned a pageblock-oriented solution also in Chapter One. I had
> proposed one before, so I'm obviously biased, but my gut feeling is
> that we likely need both - one for 2MB and smaller, and one for
> 1GB. My thinking is this:
>
> 1. Contiguous zones are more difficult and less reliable to resize at
>    runtime, and the huge page size you're trying to grow and shrink
>    the regions for matters. Assuming 4k pages (wild, I know) there are
>    512 pages in a 2MB folio, but a quarter million pages in a 1GB
>    folio. It's much easier for a single die-hard kernel allocation to
>    get in the way of expanding the THP region by another 1GB page than
>    finding 512 disjunct 2MB pageblocks somewhere.
>
>    Basically, dynamic adaptiveness of the pool seems necessary for a
>    general-purpose THP[tm] feature, but also think adaptiveness for 1G
>    huge pages is going to be difficult to pull off reliably, simply
>    because we have no control over the lifetime of kernel allocations.
>
> 2. I think there also remains a difference in audience. Reliable
>    coverage of up to 2MB would be a huge boon for most workloads,
>    especially the majority of those that are not optimized much for
>    contiguity. IIRC Willy mentioned before somewhere that nowdays the
>    optimal average page size is still in the multi-k range.
>
>    1G huge pages are immensely useful for specific loads - we
>    certainly have our share of those as well. But the step size to 1GB
>    is so large that:
>
>    1) it's fewer applications that can benefit in the first place
>
>    2) it requires applications to participate more proactively in the
>       contiguity efforts to keep internal fragmentation reasonable
>
>    3) the 1G huge pages are more expensive and less reliable when it
>       comes to growing the THP region by another page at runtime,
>       which remains a forcing function for static, boot-time configs
>
>    4) the performance impact of falling back from 1G to 2MB or 4k
>       would be quite large compared to falling back from 2M. Setups
>       that invest to overcome all of the above difficulties in order
>       to tickle more cycles out of their systems are going to be less
>       tolerant of just falling back to smaller pages
>
>    As you can see, points 2-4 take a lot of the "transparent" out of
>    "transparent huge pages".

Also there are implementation challenges for 1GB THP based on my past
experience:

1) I had triple mapping (PTE, PMD, PUD) support for 1GB THP in my
   original patchset, but the implementation is quite hacky and complicated.
   subpage mapcount is going to be a headache to maintain. We probably
   want to not support triple mapping.

2) Page migration was not in my patchset due to high migration overheads,
   although the implementation might not be hard. At least,
   splitting 1GB THP upon migration should be added to make it movable,
   otherwise, it might cause performance issue in NUMA systems.

3) Creating a 1GB THP at page fault time might cause a long latency. When
   to create and who can create will need to be discussed. khugepaged or
   process_madvise are candidates.

So it is more likely to have 1GB large folio without much of the transparent
feature.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
  2024-03-06 15:51 ` Johannes Weiner
  2024-03-06 16:40   ` Zi Yan
@ 2024-03-13 22:09   ` Kaiyang Zhao
  1 sibling, 0 replies; 28+ messages in thread
From: Kaiyang Zhao @ 2024-03-13 22:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yu Zhao, lsf-pc, linux-mm, Jonathan Corbet, kaiyang2, dskarlat

On Wed, Mar 06, 2024 at 10:51:10AM -0500, Johannes Weiner wrote:
> This is a very interesting topic to me. Meta has collaborated with CMU
> to research this as well, the results of which are typed up here:
> https://dl.acm.org/doi/pdf/10.1145/3579371.3589079
> 
> 
> I would like to be part of it, and would also suggest to have Kaiyang
> (CC'd) in the room, who is the primary researcher on the Contiguitas
> paper referenced above.

Thanks for bringing up Contiguitas, Johannes. Providing a large amount
of physical memory contiguity and managing it as a first-class resource
is very important for bringing a lot of research in virtual memory into
reality.

Johannes has already touched upon many parts of the kernel changes we
made in the Contiguitas project. To summarize, we want to confine the
unmovable allocations in a separate region in the physical address space
so that later memory contiguity can be obtained by successfully doing
compaction, and provide the ability to dynamically size the unmovable
region to adapt to changing workload characteristics and avoid static
sizing.

I will send an RFC with patches soon. Dimitrios (cc’d) and I are
interested in joining this effort and finding the best approach to
achieve our shared goal of more and easier-to-manage physical
contiguity.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations
       [not found] ` <CAOUHufZE87n_rR9EdjHLkyOt7T+Bc3a9g1Kct6cnGHXo26TDcQ@mail.gmail.com>
@ 2024-05-15 21:52   ` Yu Zhao
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Zhao @ 2024-05-15 21:52 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 4378 bytes --]

On Wed, May 15, 2024 at 3:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Feb 29, 2024 at 11:34 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > TAO is an umbrella project aiming at a better economy of physical
> > contiguity viewed as a valuable resource. A few examples are:
> > 1. A multi-tenant system can have guaranteed THP coverage while
> >    hosting abusers/misusers of the resource.
> > 2. Abusers/misusers, e.g., workloads excessively requesting and then
> >    splitting THPs, should be punished if necessary.
> > 3. Good citizens should be awarded with, e.g., lower allocation
> >    latency and less cost of metadata (struct page).
> > 4. Better interoperability with userspace memory allocators when
> >    transacting the resource.
> >
> > This project puts the same emphasis on the established use case for
> > servers and the emerging use case for clients so that client workloads
> > like Android and ChromeOS can leverage the recent multi-sized THPs
> > [1][2].
> >
> > Chapter One introduces the cornerstone of TAO: an abstraction called
> > policy (virtual) zones, which are overlayed on the physical zones.
> > This is in line with item 1 above.
> >
> > A new door is open after Chapter One. The following two chapters
> > discuss the reverse of THP collapsing, called THP shattering, and THP
> > HVO, which brings the hugeTLB feature [3] to THP. They are in line
> > with items 2 & 3 above.
> >
> > Advanced use cases are discussed in Epilogue, since they require the
> > cooperation of userspace memory allocators. This is in line with item
> > 4 above.
> >
> > [1] https://lwn.net/Articles/932386/
> > [2] https://lwn.net/Articles/937239/
> > [3] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> >
> > Yu Zhao (4):
> >   THP zones: the use cases of policy zones
> >   THP shattering: the reverse of collapsing
> >   THP HVO: bring the hugeTLB feature to THP
> >   Profile-Guided Heap Optimization and THP fungibility
> >
> >  .../admin-guide/kernel-parameters.txt         |  10 +
> >  drivers/virtio/virtio_mem.c                   |   2 +-
> >  include/linux/gfp.h                           |  24 +-
> >  include/linux/huge_mm.h                       |   6 -
> >  include/linux/memcontrol.h                    |   5 +
> >  include/linux/mempolicy.h                     |   2 +-
> >  include/linux/mm.h                            | 140 ++++++
> >  include/linux/mm_inline.h                     |  24 +
> >  include/linux/mm_types.h                      |   8 +-
> >  include/linux/mmzone.h                        |  53 +-
> >  include/linux/nodemask.h                      |   2 +-
> >  include/linux/rmap.h                          |   4 +
> >  include/linux/vm_event_item.h                 |   5 +-
> >  include/trace/events/mmflags.h                |   4 +-
> >  init/main.c                                   |   1 +
> >  mm/compaction.c                               |  12 +
> >  mm/gup.c                                      |   3 +-
> >  mm/huge_memory.c                              | 304 ++++++++++--
> >  mm/hugetlb_vmemmap.c                          |   2 +-
> >  mm/internal.h                                 |  47 +-
> >  mm/madvise.c                                  |  11 +-
> >  mm/memcontrol.c                               |  47 ++
> >  mm/memory-failure.c                           |   2 +-
> >  mm/memory.c                                   |  11 +-
> >  mm/mempolicy.c                                |  14 +-
> >  mm/migrate.c                                  |  51 +-
> >  mm/mm_init.c                                  | 452 ++++++++++--------
> >  mm/page_alloc.c                               | 199 +++++++-
> >  mm/page_isolation.c                           |   2 +-
> >  mm/rmap.c                                     |  21 +-
> >  mm/shmem.c                                    |   4 +-
> >  mm/swap_slots.c                               |   3 +-
> >  mm/truncate.c                                 |   6 +-
> >  mm/userfaultfd.c                              |   2 +-
> >  mm/vmscan.c                                   |  41 +-
> >  mm/vmstat.c                                   |  12 +-
> >  36 files changed, 1194 insertions(+), 342 deletions(-)
>
> Attaching the deck for this topic.

Let me compress the PDF and try again.

[-- Attachment #2: TAO (LSF_MM_BPF 2024).pdf --]
[-- Type: application/pdf, Size: 257591 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Chapter One] THP zones: the use cases of policy zones
  2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
                     ` (3 preceding siblings ...)
  2024-03-05  8:41   ` Barry Song
@ 2024-05-24  8:38   ` Barry Song
  4 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2024-05-24  8:38 UTC (permalink / raw)
  To: yuzhao; +Cc: corbet, linux-mm, lsf-pc

> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
>   memory.
> 2. The device zone provides interoperability between CPU and device
>    memory.
> 3. The movable zone commonly represents a memory allocation policy.
> 
> +
> +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn)
> +{
> +	int i, nid;
> +	unsigned long node_avg, remaining;

Hi Yu,

I discovered that CMA can be part of virtual zones. For example:

Node 0, zone  NoMerge
  pages free     35945
      nr_free_pages 35945
      ...
      nr_free_cma  8128
  pagesets

CMA used to be available for order-0 anonymous allocations, and
the Android kernel even prioritized it by commit[1]
"ANDROID: cma: redirect page allocation to CMA"

/*
 * Used during anonymous page fault handling.
 */
struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
						unsigned long vaddr)
{
	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_CMA;
	/*
	 * If the page is mapped with PROT_MTE, initialise the tags at the
	 * point of allocation and page zeroing as this is usually faster than
	 * separate DC ZVA and STGM.
	 */
	if (vma->vm_flags & VM_MTE)
		flags |= __GFP_ZEROTAGS;
	return vma_alloc_folio(flags, 0, vma, vaddr, false);
}

I wonder if cma is still available to order-0 when it is located in
the nomerge/nosplit zone.

And while dma_alloc_coherent() or similar APIs want to get contiguous
memory from cma, are they still as easy as before if cma is a part
of virt zones?

[1] https://android.googlesource.com/kernel/common/+/1c8aebe4c072bf18409cc78fc84407e24a437302

Thanks
Barry


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-05-24  8:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
2024-02-29 20:28   ` Matthew Wilcox
2024-03-06  3:51     ` Yu Zhao
2024-03-06  4:33       ` Matthew Wilcox
2024-02-29 23:31   ` Yang Shi
2024-03-03  2:47     ` Yu Zhao
2024-03-04 15:19   ` Matthew Wilcox
2024-03-05 17:22     ` Matthew Wilcox
2024-03-05  8:41   ` Barry Song
2024-03-05 10:07     ` Vlastimil Babka
2024-03-05 21:04       ` Barry Song
2024-03-06  3:05         ` Yu Zhao
2024-05-24  8:38   ` Barry Song
2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
2024-02-29 21:55   ` Zi Yan
2024-03-03  1:17     ` Yu Zhao
2024-03-03  1:21       ` Zi Yan
2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
2024-02-29 22:54   ` Yang Shi
2024-03-01 15:42     ` David Hildenbrand
2024-03-03  1:46     ` Yu Zhao
2024-02-29 18:34 ` [Epilogue] Profile-Guided Heap Optimization and THP fungibility Yu Zhao
2024-03-05  8:37 ` [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Barry Song
2024-03-06 15:51 ` Johannes Weiner
2024-03-06 16:40   ` Zi Yan
2024-03-13 22:09   ` Kaiyang Zhao
     [not found] ` <CAOUHufZE87n_rR9EdjHLkyOt7T+Bc3a9g1Kct6cnGHXo26TDcQ@mail.gmail.com>
2024-05-15 21:52   ` Yu Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.