[v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO

All of lore.kernel.org
 help / color / mirror / Atom feed

* [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
@ 2023-09-06 11:26 Usama Arif
  2023-09-06 11:26 ` [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it Usama Arif
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-06 11:26 UTC (permalink / raw)
  To: linux-mm, muchun.song, mike.kravetz, rppt
  Cc: linux-kernel, songmuchun, fam.zheng, liangma, punit.agrawal, Usama Arif

This series moves the boot time initialization of tail struct pages of a
gigantic page to later on in the boot. Only the
HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
are initialized at the start. If HVO is successful, then no more tail struct
pages need to be initialized. For a 1G hugepage, this series avoid
initialization of 262144 - 63 = 262081 struct pages per hugepage.

When tested on a 512G system (which can allocate max 500 1G hugepages), the
kexec-boot time with HVO and DEFERRED_STRUCT_PAGE_INIT enabled without this
patchseries to running init is 3.9 seconds. With this patch it is 1.2 seconds.
This represents an approximately 70% reduction in boot time and will
significantly reduce server downtime when using a large number of
gigantic pages.

Thanks,
Usama

[v3->v4]:
- rebase ontop of patch "hugetlb: set hugetlb page flag before optimizing vmemmap".
- freeze head struct page ref count.
- Change order of operations to initialize head struct page -> initialize
the necessary tail struct pages -> attempt HVO -> initialize the rest of the
tail struct pages if HVO fails.
- (Mike Rapoport and Muchun Song) remove "_vmemmap" suffix from memblock reserve
noinit flags anf functions.

[v2->v3]:
- (Muchun Song) skip prep of struct pages backing gigantic hugepages
at boot time only.
- (Muchun Song) move initialization of tail struct pages to after
HVO is attempted.

[v1->v2]:
- (Mike Rapoport) Code quality improvements (function names, arguments,
comments).

[RFC->v1]:
- (Mike Rapoport) Change from passing hugepage_size in
memblock_alloc_try_nid_raw for skipping struct page initialization to
using MEMBLOCK_RSRV_NOINIT flag

Usama Arif (4):
  mm: hugetlb_vmemmap: Use nid of the head page to reallocate it
  memblock: pass memblock_type to memblock_setclr_flag
  memblock: introduce MEMBLOCK_RSRV_NOINIT flag
  mm: hugetlb: Skip initialization of gigantic tail struct pages if
    freed by HVO

 include/linux/memblock.h |  9 ++++++
 mm/hugetlb.c             | 61 ++++++++++++++++++++++++++++++++++------
 mm/hugetlb_vmemmap.c     |  4 +--
 mm/hugetlb_vmemmap.h     |  9 +++---
 mm/internal.h            |  3 ++
 mm/memblock.c            | 48 ++++++++++++++++++++++---------
 mm/mm_init.c             |  2 +-
 7 files changed, 107 insertions(+), 29 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it
  2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
@ 2023-09-06 11:26 ` Usama Arif
  2023-09-06 11:26 ` [v4 2/4] memblock: pass memblock_type to memblock_setclr_flag Usama Arif
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-06 11:26 UTC (permalink / raw)
  To: linux-mm, muchun.song, mike.kravetz, rppt
  Cc: linux-kernel, songmuchun, fam.zheng, liangma, punit.agrawal, Usama Arif

If tail page prep and initialization is skipped, then the "start"
page will not contain the correct nid. Use the nid from first
vmemap page.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index aeb7dd889eee..3cdb38d87a95 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -319,7 +319,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
 	};
-	int nid = page_to_nid((struct page *)start);
+	int nid = page_to_nid((struct page *)reuse);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
 			__GFP_NOWARN;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [v4 2/4] memblock: pass memblock_type to memblock_setclr_flag
  2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
  2023-09-06 11:26 ` [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it Usama Arif
@ 2023-09-06 11:26 ` Usama Arif
  2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-06 11:26 UTC (permalink / raw)
  To: linux-mm, muchun.song, mike.kravetz, rppt
  Cc: linux-kernel, songmuchun, fam.zheng, liangma, punit.agrawal, Usama Arif

This allows setting flags to both memblock types and is in preparation for
setting flags (for e.g. to not initialize struct pages) on reserved
memory region.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memblock.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 913b2520a9a0..a49efbaee7e0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -901,10 +901,9 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
  *
  * Return: 0 on success, -errno on failure.
  */
-static int __init_memblock memblock_setclr_flag(phys_addr_t base,
-				phys_addr_t size, int set, int flag)
+static int __init_memblock memblock_setclr_flag(struct memblock_type *type,
+				phys_addr_t base, phys_addr_t size, int set, int flag)
 {
-	struct memblock_type *type = &memblock.memory;
 	int i, ret, start_rgn, end_rgn;
 
 	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
@@ -933,7 +932,7 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
  */
 int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG);
+	return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_HOTPLUG);
 }
 
 /**
@@ -945,7 +944,7 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
  */
 int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG);
+	return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_HOTPLUG);
 }
 
 /**
@@ -962,7 +961,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
 
 	system_has_some_mirror = true;
 
-	return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
+	return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_MIRROR);
 }
 
 /**
@@ -982,7 +981,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_setclr_flag(base, size, 1, MEMBLOCK_NOMAP);
+	return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_NOMAP);
 }
 
 /**
@@ -994,7 +993,7 @@ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
  */
 int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
+	return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP);
 }
 
 static bool should_skip_region(struct memblock_type *type,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag
  2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
  2023-09-06 11:26 ` [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it Usama Arif
  2023-09-06 11:26 ` [v4 2/4] memblock: pass memblock_type to memblock_setclr_flag Usama Arif
@ 2023-09-06 11:26 ` Usama Arif
  2023-09-06 11:35   ` Muchun Song
  2023-09-06 12:01   ` Mike Rapoport
  2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
  2023-09-22 14:42 ` [v4 0/4] " Pasha Tatashin
  4 siblings, 2 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-06 11:26 UTC (permalink / raw)
  To: linux-mm, muchun.song, mike.kravetz, rppt
  Cc: linux-kernel, songmuchun, fam.zheng, liangma, punit.agrawal, Usama Arif

For reserved memory regions marked with this flag,
reserve_bootmem_region is not called during memmap_init_reserved_pages.
This can be used to avoid struct page initialization for
regions which won't need them, for e.g. hugepages with
HVO enabled.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 include/linux/memblock.h |  9 +++++++++
 mm/memblock.c            | 33 ++++++++++++++++++++++++++++-----
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 1c1072e3ca06..ae3bde302f70 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,6 +40,8 @@ extern unsigned long long max_possible_pfn;
  * via a driver, and never indicated in the firmware-provided memory map as
  * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
  * kernel resource tree.
+ * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
+ * not initialized (only for reserved regions).
  */
 enum memblock_flags {
 	MEMBLOCK_NONE		= 0x0,	/* No special request */
@@ -47,6 +49,7 @@ enum memblock_flags {
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
 	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
+	MEMBLOCK_RSRV_NOINIT	= 0x10,	/* don't initialize struct pages */
 };
 
 /**
@@ -125,6 +128,7 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
 void memblock_free(void *ptr, size_t size);
@@ -259,6 +263,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_reserved_noinit(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_RSRV_NOINIT;
+}
+
 static inline bool memblock_is_driver_managed(struct memblock_region *m)
 {
 	return m->flags & MEMBLOCK_DRIVER_MANAGED;
diff --git a/mm/memblock.c b/mm/memblock.c
index a49efbaee7e0..8f7a0cb668d4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -996,6 +996,24 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP);
 }
 
+/**
+ * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
+ * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
+ * for this region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * struct pages will not be initialized for reserved memory regions marked with
+ * %MEMBLOCK_RSRV_NOINIT.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(&memblock.reserved, base, size, 1,
+				    MEMBLOCK_RSRV_NOINIT);
+}
+
 static bool should_skip_region(struct memblock_type *type,
 			       struct memblock_region *m,
 			       int nid, int flags)
@@ -2112,13 +2130,18 @@ static void __init memmap_init_reserved_pages(void)
 		memblock_set_node(start, end, &memblock.reserved, nid);
 	}
 
-	/* initialize struct pages for the reserved regions */
+	/*
+	 * initialize struct pages for reserved regions that don't have
+	 * the MEMBLOCK_RSRV_NOINIT flag set
+	 */
 	for_each_reserved_mem_region(region) {
-		nid = memblock_get_region_node(region);
-		start = region->base;
-		end = start + region->size;
+		if (!memblock_is_reserved_noinit(region)) {
+			nid = memblock_get_region_node(region);
+			start = region->base;
+			end = start + region->size;
 
-		reserve_bootmem_region(start, end, nid);
+			reserve_bootmem_region(start, end, nid);
+		}
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
                   ` (2 preceding siblings ...)
  2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
@ 2023-09-06 11:26 ` Usama Arif
  2023-09-06 18:10   ` Mike Kravetz
  2023-09-07 18:37   ` Mike Kravetz
  2023-09-22 14:42 ` [v4 0/4] " Pasha Tatashin
  4 siblings, 2 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-06 11:26 UTC (permalink / raw)
  To: linux-mm, muchun.song, mike.kravetz, rppt
  Cc: linux-kernel, songmuchun, fam.zheng, liangma, punit.agrawal, Usama Arif

The new boot flow when it comes to initialization of gigantic pages
is as follows:
- At boot time, for a gigantic page during __alloc_bootmem_hugepage,
the region after the first struct page is marked as noinit.
- This results in only the first struct page to be
initialized in reserve_bootmem_region. As the tail struct pages are
not initialized at this point, there can be a significant saving
in boot time if HVO succeeds later on.
- Later on in the boot, the head page is prepped and the first
HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
are initialized.
- HVO is attempted. If it is not successful, then the rest of the
tail struct pages are initialized. If it is successful, no more
tail struct pages need to be initialized saving significant boot time.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
 mm/hugetlb_vmemmap.c |  2 +-
 mm/hugetlb_vmemmap.h |  9 ++++---
 mm/internal.h        |  3 +++
 mm/mm_init.c         |  2 +-
 5 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c32ca241df4b..540e0386514e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	}
 
 found:
+
+	/*
+	 * Only initialize the head struct page in memmap_init_reserved_pages,
+	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
+	 * The head struct page is used to get folio information by the HugeTLB
+	 * subsystem like zone id and node id.
+	 */
+	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
+		huge_page_size(h) - PAGE_SIZE);
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
 	list_add(&m->list, &huge_boot_pages);
@@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return 1;
 }
 
+/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
+static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
+						    unsigned long start_page_number,
+						    unsigned long end_page_number)
+{
+	enum zone_type zone = zone_idx(folio_zone(folio));
+	int nid = folio_nid(folio);
+	unsigned long head_pfn = folio_pfn(folio);
+	unsigned long pfn, end_pfn = head_pfn + end_page_number;
+
+	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		__init_single_page(page, pfn, zone, nid);
+		prep_compound_tail((struct page *)folio, pfn - head_pfn);
+		set_page_count(page, 0);
+	}
+}
+
+static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
+					       unsigned long nr_pages)
+{
+	int ret;
+
+	/* Prepare folio head */
+	__folio_clear_reserved(folio);
+	__folio_set_head(folio);
+	ret = page_ref_freeze(&folio->page, 1);
+	VM_BUG_ON(!ret);
+	/* Initialize the necessary tail struct pages */
+	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
+	prep_compound_head((struct page *)folio, huge_page_order(h));
+}
+
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_ORDER) pages.
@@ -3186,19 +3229,19 @@ static void __init gather_bootmem_prealloc(void)
 
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
-		struct folio *folio = page_folio(page);
+		struct folio *folio = (void *)page;
 		struct hstate *h = m->hstate;
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(folio_ref_count(folio) != 1);
-		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
-			WARN_ON(folio_test_reserved(folio));
-			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
-			free_huge_folio(folio); /* add to the hugepage allocator */
-		} else {
-			/* VERY unlikely inflated ref count on a tail page */
-			free_gigantic_folio(folio, huge_page_order(h));
-		}
+
+		hugetlb_folio_init_vmemmap(folio, h, HUGETLB_VMEMMAP_RESERVE_PAGES);
+		prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+		/* If HVO fails, initialize all tail struct pages */
+		if (!HPageVmemmapOptimized(&folio->page))
+			hugetlb_folio_init_tail_vmemmap(folio, HUGETLB_VMEMMAP_RESERVE_PAGES,
+							pages_per_huge_page(h));
+		free_huge_folio(folio); /* add to the hugepage allocator */
 
 		/*
 		 * We need to restore the 'stolen' pages to totalram_pages
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 3cdb38d87a95..772a877918d7 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -589,7 +589,7 @@ static int __init hugetlb_vmemmap_init(void)
 	const struct hstate *h;
 
 	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
-	BUILD_BUG_ON(__NR_USED_SUBPAGE * sizeof(struct page) > HUGETLB_VMEMMAP_RESERVE_SIZE);
+	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
 
 	for_each_hstate(h) {
 		if (hugetlb_vmemmap_optimizable(h)) {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 25bd0e002431..4573899855d7 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -10,15 +10,16 @@
 #define _LINUX_HUGETLB_VMEMMAP_H
 #include <linux/hugetlb.h>
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
-void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
-
 /*
  * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
  * Documentation/vm/vmemmap_dedup.rst.
  */
 #define HUGETLB_VMEMMAP_RESERVE_SIZE	PAGE_SIZE
+#define HUGETLB_VMEMMAP_RESERVE_PAGES	(HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page))
+
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
 
 static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
 {
diff --git a/mm/internal.h b/mm/internal.h
index d1d4bf4e63c0..d74061aa6de7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1154,4 +1154,7 @@ struct vma_prepare {
 	struct vm_area_struct *remove;
 	struct vm_area_struct *remove2;
 };
+
+void __meminit __init_single_page(struct page *page, unsigned long pfn,
+				unsigned long zone, int nid);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34745af..fed4370b02e1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -555,7 +555,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	node_states[N_MEMORY] = saved_node_state;
 }
 
-static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag
  2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
@ 2023-09-06 11:35   ` Muchun Song
  2023-09-06 12:01   ` Mike Rapoport
  1 sibling, 0 replies; 17+ messages in thread
From: Muchun Song @ 2023-09-06 11:35 UTC (permalink / raw)
  To: Usama Arif
  Cc: Linux-MM, Mike Kravetz, Mike Rapoport (IBM),
	LKML, Muchun Song, fam.zheng, liangma, punit.agrawal



> On Sep 6, 2023, at 19:26, Usama Arif <usama.arif@bytedance.com> wrote:
> 
> For reserved memory regions marked with this flag,
> reserve_bootmem_region is not called during memmap_init_reserved_pages.
> This can be used to avoid struct page initialization for
> regions which won't need them, for e.g. hugepages with
> HVO enabled.
> 
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag
  2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
  2023-09-06 11:35   ` Muchun Song
@ 2023-09-06 12:01   ` Mike Rapoport
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Rapoport @ 2023-09-06 12:01 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, mike.kravetz, linux-kernel, songmuchun,
	fam.zheng, liangma, punit.agrawal

On Wed, Sep 06, 2023 at 12:26:04PM +0100, Usama Arif wrote:
> For reserved memory regions marked with this flag,
> reserve_bootmem_region is not called during memmap_init_reserved_pages.
> This can be used to avoid struct page initialization for
> regions which won't need them, for e.g. hugepages with
> HVO enabled.

Nit: please spell out HVO, otherwise
 
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>

Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>

> ---
>  include/linux/memblock.h |  9 +++++++++
>  mm/memblock.c            | 33 ++++++++++++++++++++++++++++-----
>  2 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 1c1072e3ca06..ae3bde302f70 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -40,6 +40,8 @@ extern unsigned long long max_possible_pfn;
>   * via a driver, and never indicated in the firmware-provided memory map as
>   * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
>   * kernel resource tree.
> + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
> + * not initialized (only for reserved regions).
>   */
>  enum memblock_flags {
>  	MEMBLOCK_NONE		= 0x0,	/* No special request */
> @@ -47,6 +49,7 @@ enum memblock_flags {
>  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
>  	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
> +	MEMBLOCK_RSRV_NOINIT	= 0x10,	/* don't initialize struct pages */
>  };
>  
>  /**
> @@ -125,6 +128,7 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> +int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
>  
>  void memblock_free_all(void);
>  void memblock_free(void *ptr, size_t size);
> @@ -259,6 +263,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> +static inline bool memblock_is_reserved_noinit(struct memblock_region *m)
> +{
> +	return m->flags & MEMBLOCK_RSRV_NOINIT;
> +}
> +
>  static inline bool memblock_is_driver_managed(struct memblock_region *m)
>  {
>  	return m->flags & MEMBLOCK_DRIVER_MANAGED;
> diff --git a/mm/memblock.c b/mm/memblock.c
> index a49efbaee7e0..8f7a0cb668d4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -996,6 +996,24 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  	return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP);
>  }
>  
> +/**
> + * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> + * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> + * for this region.
> + * @base: the base phys addr of the region
> + * @size: the size of the region
> + *
> + * struct pages will not be initialized for reserved memory regions marked with
> + * %MEMBLOCK_RSRV_NOINIT.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size)
> +{
> +	return memblock_setclr_flag(&memblock.reserved, base, size, 1,
> +				    MEMBLOCK_RSRV_NOINIT);
> +}
> +
>  static bool should_skip_region(struct memblock_type *type,
>  			       struct memblock_region *m,
>  			       int nid, int flags)
> @@ -2112,13 +2130,18 @@ static void __init memmap_init_reserved_pages(void)
>  		memblock_set_node(start, end, &memblock.reserved, nid);
>  	}
>  
> -	/* initialize struct pages for the reserved regions */
> +	/*
> +	 * initialize struct pages for reserved regions that don't have
> +	 * the MEMBLOCK_RSRV_NOINIT flag set
> +	 */
>  	for_each_reserved_mem_region(region) {
> -		nid = memblock_get_region_node(region);
> -		start = region->base;
> -		end = start + region->size;
> +		if (!memblock_is_reserved_noinit(region)) {
> +			nid = memblock_get_region_node(region);
> +			start = region->base;
> +			end = start + region->size;
>  
> -		reserve_bootmem_region(start, end, nid);
> +			reserve_bootmem_region(start, end, nid);
> +		}
>  	}
>  }
>  
> -- 
> 2.25.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
@ 2023-09-06 18:10   ` Mike Kravetz
  2023-09-06 21:27     ` [External] " Usama Arif
  2023-09-07 18:37   ` Mike Kravetz
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-09-06 18:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal

On 09/06/23 12:26, Usama Arif wrote:
> The new boot flow when it comes to initialization of gigantic pages
> is as follows:
> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> the region after the first struct page is marked as noinit.
> - This results in only the first struct page to be
> initialized in reserve_bootmem_region. As the tail struct pages are
> not initialized at this point, there can be a significant saving
> in boot time if HVO succeeds later on.
> - Later on in the boot, the head page is prepped and the first
> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> are initialized.
> - HVO is attempted. If it is not successful, then the rest of the
> tail struct pages are initialized. If it is successful, no more
> tail struct pages need to be initialized saving significant boot time.

Code looks reasonable.  Quick question.

On systems where HVO is disabled, we will still go through this new boot
flow and init hugetlb tail pages later in boot (gather_bootmem_prealloc).
Correct?
If yes, will there be a noticeable change in performance from the current
flow with HVO disabled?  My concern would be allocating a large number of
gigantic pages at boot (TB or more).

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 18:10   ` Mike Kravetz
@ 2023-09-06 21:27     ` Usama Arif
  2023-09-06 21:59       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2023-09-06 21:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal



On 06/09/2023 19:10, Mike Kravetz wrote:
> On 09/06/23 12:26, Usama Arif wrote:
>> The new boot flow when it comes to initialization of gigantic pages
>> is as follows:
>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
>> the region after the first struct page is marked as noinit.
>> - This results in only the first struct page to be
>> initialized in reserve_bootmem_region. As the tail struct pages are
>> not initialized at this point, there can be a significant saving
>> in boot time if HVO succeeds later on.
>> - Later on in the boot, the head page is prepped and the first
>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
>> are initialized.
>> - HVO is attempted. If it is not successful, then the rest of the
>> tail struct pages are initialized. If it is successful, no more
>> tail struct pages need to be initialized saving significant boot time.
> 
> Code looks reasonable.  Quick question.
> 
> On systems where HVO is disabled, we will still go through this new boot
> flow and init hugetlb tail pages later in boot (gather_bootmem_prealloc).
> Correct?
> If yes, will there be a noticeable change in performance from the current
> flow with HVO disabled?  My concern would be allocating a large number of
> gigantic pages at boot (TB or more).
> 

Thanks for the review.

The patch moves the initialization of struct pages backing hugepage from 
reserve_bootmem_region to a bit later on in the boot to 
gather_bootmem_prealloc. When HVO is disabled, there will be no 
difference in time taken to boot with or without this patch series, as 
262144 struct pages per gigantic page (for x86) are still going to be 
initialized, just in a different place.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 21:27     ` [External] " Usama Arif
@ 2023-09-06 21:59       ` Mike Kravetz
  2023-09-07 10:14         ` Usama Arif
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-09-06 21:59 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal

On 09/06/23 22:27, Usama Arif wrote:
> 
> 
> On 06/09/2023 19:10, Mike Kravetz wrote:
> > On 09/06/23 12:26, Usama Arif wrote:
> > > The new boot flow when it comes to initialization of gigantic pages
> > > is as follows:
> > > - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> > > the region after the first struct page is marked as noinit.
> > > - This results in only the first struct page to be
> > > initialized in reserve_bootmem_region. As the tail struct pages are
> > > not initialized at this point, there can be a significant saving
> > > in boot time if HVO succeeds later on.
> > > - Later on in the boot, the head page is prepped and the first
> > > HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> > > are initialized.
> > > - HVO is attempted. If it is not successful, then the rest of the
> > > tail struct pages are initialized. If it is successful, no more
> > > tail struct pages need to be initialized saving significant boot time.
> > 
> > Code looks reasonable.  Quick question.
> > 
> > On systems where HVO is disabled, we will still go through this new boot
> > flow and init hugetlb tail pages later in boot (gather_bootmem_prealloc).
> > Correct?
> > If yes, will there be a noticeable change in performance from the current
> > flow with HVO disabled?  My concern would be allocating a large number of
> > gigantic pages at boot (TB or more).
> > 
> 
> Thanks for the review.
> 
> The patch moves the initialization of struct pages backing hugepage from
> reserve_bootmem_region to a bit later on in the boot to
> gather_bootmem_prealloc. When HVO is disabled, there will be no difference
> in time taken to boot with or without this patch series, as 262144 struct
> pages per gigantic page (for x86) are still going to be initialized, just in
> a different place.

I seem to recall that 'normal' deferred struct page initialization was
done in parallel as the result of these series:
https://lore.kernel.org/linux-mm/20171013173214.27300-1-pasha.tatashin@oracle.com/
https://lore.kernel.org/linux-mm/20200527173608.2885243-1-daniel.m.jordan@oracle.com/#t
and perhaps others.

My thought is that we lose that parallel initialization when it is being
done as part of hugetlb fall back initialization.

Does that make sense?  Or am I missing something?  I do not have any proof
that things will be slower.  That is just something I was thinking about.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 21:59       ` Mike Kravetz
@ 2023-09-07 10:14         ` Usama Arif
  2023-09-07 18:24           ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2023-09-07 10:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal



On 06/09/2023 22:59, Mike Kravetz wrote:
> On 09/06/23 22:27, Usama Arif wrote:
>>
>>
>> On 06/09/2023 19:10, Mike Kravetz wrote:
>>> On 09/06/23 12:26, Usama Arif wrote:
>>>> The new boot flow when it comes to initialization of gigantic pages
>>>> is as follows:
>>>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
>>>> the region after the first struct page is marked as noinit.
>>>> - This results in only the first struct page to be
>>>> initialized in reserve_bootmem_region. As the tail struct pages are
>>>> not initialized at this point, there can be a significant saving
>>>> in boot time if HVO succeeds later on.
>>>> - Later on in the boot, the head page is prepped and the first
>>>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
>>>> are initialized.
>>>> - HVO is attempted. If it is not successful, then the rest of the
>>>> tail struct pages are initialized. If it is successful, no more
>>>> tail struct pages need to be initialized saving significant boot time.
>>>
>>> Code looks reasonable.  Quick question.
>>>
>>> On systems where HVO is disabled, we will still go through this new boot
>>> flow and init hugetlb tail pages later in boot (gather_bootmem_prealloc).
>>> Correct?
>>> If yes, will there be a noticeable change in performance from the current
>>> flow with HVO disabled?  My concern would be allocating a large number of
>>> gigantic pages at boot (TB or more).
>>>
>>
>> Thanks for the review.
>>
>> The patch moves the initialization of struct pages backing hugepage from
>> reserve_bootmem_region to a bit later on in the boot to
>> gather_bootmem_prealloc. When HVO is disabled, there will be no difference
>> in time taken to boot with or without this patch series, as 262144 struct
>> pages per gigantic page (for x86) are still going to be initialized, just in
>> a different place.
> 
> I seem to recall that 'normal' deferred struct page initialization was
> done in parallel as the result of these series:
> https://lore.kernel.org/linux-mm/20171013173214.27300-1-pasha.tatashin@oracle.com/
> https://lore.kernel.org/linux-mm/20200527173608.2885243-1-daniel.m.jordan@oracle.com/#t
> and perhaps others.
> 
> My thought is that we lose that parallel initialization when it is being
> done as part of hugetlb fall back initialization.
> 
> Does that make sense?  Or am I missing something?  I do not have any proof
> that things will be slower.  That is just something I was thinking about.

The patches for deferring struct page initialization did not cover the 
struct pages for gigantic pages.

With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the function call taken 
during boot without these patches is:

[A1] mm_core_init-> mem_init-> memblock_free_all-> 
free_low_memory_core_early-> memmap_init_reserved_pages-> 
reserve_bootmem_region-> initialize *all* struct pages of a gigantic 
page serially (DEFERRED_STRUCT_PAGE_INIT is enabled).
The pfn of the struct pages > NODE_DATA(nid)->first_deferred_pfn which 
means this cannot be deferred.

then later on in the boot:

[A2] hugetlb_init-> gather_bootmem_prealloc-> 
prep_compound_gigantic_folio-> prepare *all* the struct pages to be part 
of a gigantic page (freezing page ref count, setting compound head, etc 
for all struct pages)

With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the function call taken 
during boot with these patches is:

[B1] mm_core_init->...reserve_bootmem_region-> initialize head struct 
page only.

then later on in the boot:

[B2] hugetlb_init-> gather_bootmem_prealloc-> [B21] initialize only 64 
tail struct pages if HVO passes. [B22] If HVO fails initialize all tail 
struct pages.


Each of A1, A2 and B22 are for loops going over 262144 struct pages per 
hugepage. So without these patches, the work done is 262144*2 (A1+A2) 
per hugepage during boot, even with CONFIG_DEFERRED_STRUCT_PAGE_INIT as 
its not deferred. With these patches, the work done is either 1 + 64 
(B1+B21) if HVO is enabled or 1 + 262144 (B1+B22) if HVO is disabled.

With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the times taken to boot 
till init process when allocating 500 1G hugeppages are:
- with these patches, HVO enabled: 1.32 seconds [B1 + B21]
- with patches, HVO disabled: 2.15 seconds [B1 + B22]
- without patches, HVO enabled: 3.90  seconds [A1 + A2 + HVO]
- without patches, HVO disabled: 3.58 seconds [A1 + A2]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-07 10:14         ` Usama Arif
@ 2023-09-07 18:24           ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal

On 09/07/23 11:14, Usama Arif wrote:
> 
> 
> On 06/09/2023 22:59, Mike Kravetz wrote:
> > On 09/06/23 22:27, Usama Arif wrote:
> > > 
> > > 
> > > On 06/09/2023 19:10, Mike Kravetz wrote:
> > > > On 09/06/23 12:26, Usama Arif wrote:
> > > > > The new boot flow when it comes to initialization of gigantic pages
> > > > > is as follows:
> > > > > - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> > > > > the region after the first struct page is marked as noinit.
> > > > > - This results in only the first struct page to be
> > > > > initialized in reserve_bootmem_region. As the tail struct pages are
> > > > > not initialized at this point, there can be a significant saving
> > > > > in boot time if HVO succeeds later on.
> > > > > - Later on in the boot, the head page is prepped and the first
> > > > > HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> > > > > are initialized.
> > > > > - HVO is attempted. If it is not successful, then the rest of the
> > > > > tail struct pages are initialized. If it is successful, no more
> > > > > tail struct pages need to be initialized saving significant boot time.
> > > > 
> > > > Code looks reasonable.  Quick question.
> > > > 
> > > > On systems where HVO is disabled, we will still go through this new boot
> > > > flow and init hugetlb tail pages later in boot (gather_bootmem_prealloc).
> > > > Correct?
> > > > If yes, will there be a noticeable change in performance from the current
> > > > flow with HVO disabled?  My concern would be allocating a large number of
> > > > gigantic pages at boot (TB or more).
> > > > 
> > > 
> > > Thanks for the review.
> > > 
> > > The patch moves the initialization of struct pages backing hugepage from
> > > reserve_bootmem_region to a bit later on in the boot to
> > > gather_bootmem_prealloc. When HVO is disabled, there will be no difference
> > > in time taken to boot with or without this patch series, as 262144 struct
> > > pages per gigantic page (for x86) are still going to be initialized, just in
> > > a different place.
> > 
> > I seem to recall that 'normal' deferred struct page initialization was
> > done in parallel as the result of these series:
> > https://lore.kernel.org/linux-mm/20171013173214.27300-1-pasha.tatashin@oracle.com/
> > https://lore.kernel.org/linux-mm/20200527173608.2885243-1-daniel.m.jordan@oracle.com/#t
> > and perhaps others.
> > 
> > My thought is that we lose that parallel initialization when it is being
> > done as part of hugetlb fall back initialization.
> > 
> > Does that make sense?  Or am I missing something?  I do not have any proof
> > that things will be slower.  That is just something I was thinking about.
> 
> The patches for deferring struct page initialization did not cover the
> struct pages for gigantic pages.
> 
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the function call taken
> during boot without these patches is:
> 
> [A1] mm_core_init-> mem_init-> memblock_free_all->
> free_low_memory_core_early-> memmap_init_reserved_pages->
> reserve_bootmem_region-> initialize *all* struct pages of a gigantic page
> serially (DEFERRED_STRUCT_PAGE_INIT is enabled).
> The pfn of the struct pages > NODE_DATA(nid)->first_deferred_pfn which means
> this cannot be deferred.

Thank you very much!
I am not very familiar with the init process and just wanted to make sure that
no possible performance regression was introduced.

In will make some specific comments on the patch, but as previously stated it
looks pretty good.

-- 
Mike Kravetz

> then later on in the boot:
> 
> [A2] hugetlb_init-> gather_bootmem_prealloc-> prep_compound_gigantic_folio->
> prepare *all* the struct pages to be part of a gigantic page (freezing page
> ref count, setting compound head, etc for all struct pages)
> 
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the function call taken
> during boot with these patches is:
> 
> [B1] mm_core_init->...reserve_bootmem_region-> initialize head struct page
> only.
> 
> then later on in the boot:
> 
> [B2] hugetlb_init-> gather_bootmem_prealloc-> [B21] initialize only 64 tail
> struct pages if HVO passes. [B22] If HVO fails initialize all tail struct
> pages.
> 
> 
> Each of A1, A2 and B22 are for loops going over 262144 struct pages per
> hugepage. So without these patches, the work done is 262144*2 (A1+A2) per
> hugepage during boot, even with CONFIG_DEFERRED_STRUCT_PAGE_INIT as its not
> deferred. With these patches, the work done is either 1 + 64 (B1+B21) if HVO
> is enabled or 1 + 262144 (B1+B22) if HVO is disabled.
> 
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, the times taken to boot till
> init process when allocating 500 1G hugeppages are:
> - with these patches, HVO enabled: 1.32 seconds [B1 + B21]
> - with patches, HVO disabled: 2.15 seconds [B1 + B22]
> - without patches, HVO enabled: 3.90  seconds [A1 + A2 + HVO]
> - without patches, HVO disabled: 3.58 seconds [A1 + A2]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
  2023-09-06 18:10   ` Mike Kravetz
@ 2023-09-07 18:37   ` Mike Kravetz
  2023-09-08  2:39     ` Muchun Song
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:37 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, rppt, linux-kernel, songmuchun, fam.zheng,
	liangma, punit.agrawal

On 09/06/23 12:26, Usama Arif wrote:
> The new boot flow when it comes to initialization of gigantic pages
> is as follows:
> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> the region after the first struct page is marked as noinit.
> - This results in only the first struct page to be
> initialized in reserve_bootmem_region. As the tail struct pages are
> not initialized at this point, there can be a significant saving
> in boot time if HVO succeeds later on.
> - Later on in the boot, the head page is prepped and the first
> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> are initialized.
> - HVO is attempted. If it is not successful, then the rest of the
> tail struct pages are initialized. If it is successful, no more
> tail struct pages need to be initialized saving significant boot time.
> 
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>  mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
>  mm/hugetlb_vmemmap.c |  2 +-
>  mm/hugetlb_vmemmap.h |  9 ++++---
>  mm/internal.h        |  3 +++
>  mm/mm_init.c         |  2 +-
>  5 files changed, 62 insertions(+), 15 deletions(-)

As mentioned, in general this looks good.  One small point below.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c32ca241df4b..540e0386514e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>  	}
>  
>  found:
> +
> +	/*
> +	 * Only initialize the head struct page in memmap_init_reserved_pages,
> +	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
> +	 * The head struct page is used to get folio information by the HugeTLB
> +	 * subsystem like zone id and node id.
> +	 */
> +	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
> +		huge_page_size(h) - PAGE_SIZE);
>  	/* Put them into a private list first because mem_map is not up yet */
>  	INIT_LIST_HEAD(&m->list);
>  	list_add(&m->list, &huge_boot_pages);
> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>  	return 1;
>  }
>  
> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> +						    unsigned long start_page_number,
> +						    unsigned long end_page_number)
> +{
> +	enum zone_type zone = zone_idx(folio_zone(folio));
> +	int nid = folio_nid(folio);
> +	unsigned long head_pfn = folio_pfn(folio);
> +	unsigned long pfn, end_pfn = head_pfn + end_page_number;
> +
> +	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
> +		struct page *page = pfn_to_page(pfn);
> +
> +		__init_single_page(page, pfn, zone, nid);
> +		prep_compound_tail((struct page *)folio, pfn - head_pfn);
> +		set_page_count(page, 0);
> +	}
> +}
> +
> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
> +					       unsigned long nr_pages)
> +{
> +	int ret;
> +
> +	/* Prepare folio head */
> +	__folio_clear_reserved(folio);
> +	__folio_set_head(folio);
> +	ret = page_ref_freeze(&folio->page, 1);
> +	VM_BUG_ON(!ret);

In the current code, we print a warning and free the associated pages to
buddy if we ever experience an increased ref count.  The routine
hugetlb_folio_init_tail_vmemmap does not check for this.

I do not believe speculative/temporary ref counts this early in the boot
process are possible.  It would be great to get input from someone else.

When I wrote the existing code, it was fairly easy to WARN and continue
if we encountered an increased ref count.  Things would be bit more
complicated here.  So, it may not be worth the effort.
-- 
Mike Kravetz

> +	/* Initialize the necessary tail struct pages */
> +	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
> +	prep_compound_head((struct page *)folio, huge_page_order(h));
> +}
> +
>  /*
>   * Put bootmem huge pages into the standard lists after mem_map is up.
>   * Note: This only applies to gigantic (order > MAX_ORDER) pages.
> @@ -3186,19 +3229,19 @@ static void __init gather_bootmem_prealloc(void)
>  
>  	list_for_each_entry(m, &huge_boot_pages, list) {
>  		struct page *page = virt_to_page(m);
> -		struct folio *folio = page_folio(page);
> +		struct folio *folio = (void *)page;
>  		struct hstate *h = m->hstate;
>  
>  		VM_BUG_ON(!hstate_is_gigantic(h));
>  		WARN_ON(folio_ref_count(folio) != 1);
> -		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
> -			WARN_ON(folio_test_reserved(folio));
> -			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
> -			free_huge_folio(folio); /* add to the hugepage allocator */
> -		} else {
> -			/* VERY unlikely inflated ref count on a tail page */
> -			free_gigantic_folio(folio, huge_page_order(h));
> -		}
> +
> +		hugetlb_folio_init_vmemmap(folio, h, HUGETLB_VMEMMAP_RESERVE_PAGES);
> +		prep_new_hugetlb_folio(h, folio, folio_nid(folio));
> +		/* If HVO fails, initialize all tail struct pages */
> +		if (!HPageVmemmapOptimized(&folio->page))
> +			hugetlb_folio_init_tail_vmemmap(folio, HUGETLB_VMEMMAP_RESERVE_PAGES,
> +							pages_per_huge_page(h));
> +		free_huge_folio(folio); /* add to the hugepage allocator */
>  
>  		/*
>  		 * We need to restore the 'stolen' pages to totalram_pages
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 3cdb38d87a95..772a877918d7 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -589,7 +589,7 @@ static int __init hugetlb_vmemmap_init(void)
>  	const struct hstate *h;
>  
>  	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
> -	BUILD_BUG_ON(__NR_USED_SUBPAGE * sizeof(struct page) > HUGETLB_VMEMMAP_RESERVE_SIZE);
> +	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
>  
>  	for_each_hstate(h) {
>  		if (hugetlb_vmemmap_optimizable(h)) {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 25bd0e002431..4573899855d7 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -10,15 +10,16 @@
>  #define _LINUX_HUGETLB_VMEMMAP_H
>  #include <linux/hugetlb.h>
>  
> -#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
> -
>  /*
>   * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
>   * Documentation/vm/vmemmap_dedup.rst.
>   */
>  #define HUGETLB_VMEMMAP_RESERVE_SIZE	PAGE_SIZE
> +#define HUGETLB_VMEMMAP_RESERVE_PAGES	(HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page))
> +
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
>  
>  static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
>  {
> diff --git a/mm/internal.h b/mm/internal.h
> index d1d4bf4e63c0..d74061aa6de7 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1154,4 +1154,7 @@ struct vma_prepare {
>  	struct vm_area_struct *remove;
>  	struct vm_area_struct *remove2;
>  };
> +
> +void __meminit __init_single_page(struct page *page, unsigned long pfn,
> +				unsigned long zone, int nid);
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 50f2f34745af..fed4370b02e1 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -555,7 +555,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  	node_states[N_MEMORY] = saved_node_state;
>  }
>  
> -static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> +void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid)
>  {
>  	mm_zero_struct_page(page);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-07 18:37   ` Mike Kravetz
@ 2023-09-08  2:39     ` Muchun Song
  2023-09-08 18:29       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Muchun Song @ 2023-09-08  2:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Usama Arif, Linux-MM, Mike Rapoport (IBM),
	LKML, Muchun Song, fam.zheng, liangma, punit.agrawal



> On Sep 8, 2023, at 02:37, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 09/06/23 12:26, Usama Arif wrote:
>> The new boot flow when it comes to initialization of gigantic pages
>> is as follows:
>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
>> the region after the first struct page is marked as noinit.
>> - This results in only the first struct page to be
>> initialized in reserve_bootmem_region. As the tail struct pages are
>> not initialized at this point, there can be a significant saving
>> in boot time if HVO succeeds later on.
>> - Later on in the boot, the head page is prepped and the first
>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
>> are initialized.
>> - HVO is attempted. If it is not successful, then the rest of the
>> tail struct pages are initialized. If it is successful, no more
>> tail struct pages need to be initialized saving significant boot time.
>> 
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>> mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
>> mm/hugetlb_vmemmap.c |  2 +-
>> mm/hugetlb_vmemmap.h |  9 ++++---
>> mm/internal.h        |  3 +++
>> mm/mm_init.c         |  2 +-
>> 5 files changed, 62 insertions(+), 15 deletions(-)
> 
> As mentioned, in general this looks good.  One small point below.
> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index c32ca241df4b..540e0386514e 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>> }
>> 
>> found:
>> +
>> + 	/*
>> + 	 * Only initialize the head struct page in memmap_init_reserved_pages,
>> + 	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
>> + 	 * The head struct page is used to get folio information by the HugeTLB
>> + 	 * subsystem like zone id and node id.
>> + 	 */
>> + 	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
>> + 	huge_page_size(h) - PAGE_SIZE);
>> 	/* Put them into a private list first because mem_map is not up yet */
>> 	INIT_LIST_HEAD(&m->list);
>> 	list_add(&m->list, &huge_boot_pages);
>> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>> 	return 1;
>> }
>> 
>> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
>> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
>> +     		unsigned long start_page_number,
>> +     		unsigned long end_page_number)
>> +{
>> + 	enum zone_type zone = zone_idx(folio_zone(folio));
>> + 	int nid = folio_nid(folio);
>> + 	unsigned long head_pfn = folio_pfn(folio);
>> + 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
>> +
>> + 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
>> + 	struct page *page = pfn_to_page(pfn);
>> +
>> + 		__init_single_page(page, pfn, zone, nid);
>> + 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
>> + 		set_page_count(page, 0);
>> + 	}
>> +}
>> +
>> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
>> +        unsigned long nr_pages)
>> +{
>> + 	int ret;
>> +
>> + 	/* Prepare folio head */
>> +	 __folio_clear_reserved(folio);
>> + 	__folio_set_head(folio);
>> + 	ret = page_ref_freeze(&folio->page, 1);
>> + 	VM_BUG_ON(!ret);
> 
> In the current code, we print a warning and free the associated pages to
> buddy if we ever experience an increased ref count.  The routine
> hugetlb_folio_init_tail_vmemmap does not check for this.
> 
> I do not believe speculative/temporary ref counts this early in the boot
> process are possible.  It would be great to get input from someone else.

Yes, it is a very early stage and other tail struct pages haven't been
initialized yet, anyone should not reference them. It it the same case
as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

> 
> When I wrote the existing code, it was fairly easy to WARN and continue
> if we encountered an increased ref count.  Things would be bit more

In your case, I think it is not in the boot process, right?

> complicated here.  So, it may not be worth the effort.

Agree. Note that tail struct pages are not initialized here, if we want to
handle head page, how to handle tail pages? It really cannot resolved.
We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT
that anyone should not reference those pages.

Thanks.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-08  2:39     ` Muchun Song
@ 2023-09-08 18:29       ` Mike Kravetz
  2023-09-08 20:48         ` [External] " Usama Arif
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-09-08 18:29 UTC (permalink / raw)
  To: Muchun Song
  Cc: Usama Arif, Linux-MM, Mike Rapoport (IBM),
	LKML, Muchun Song, fam.zheng, liangma, punit.agrawal

On 09/08/23 10:39, Muchun Song wrote:
> 
> 
> > On Sep 8, 2023, at 02:37, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > On 09/06/23 12:26, Usama Arif wrote:
> >> The new boot flow when it comes to initialization of gigantic pages
> >> is as follows:
> >> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> >> the region after the first struct page is marked as noinit.
> >> - This results in only the first struct page to be
> >> initialized in reserve_bootmem_region. As the tail struct pages are
> >> not initialized at this point, there can be a significant saving
> >> in boot time if HVO succeeds later on.
> >> - Later on in the boot, the head page is prepped and the first
> >> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> >> are initialized.
> >> - HVO is attempted. If it is not successful, then the rest of the
> >> tail struct pages are initialized. If it is successful, no more
> >> tail struct pages need to be initialized saving significant boot time.
> >> 
> >> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> >> ---
> >> mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
> >> mm/hugetlb_vmemmap.c |  2 +-
> >> mm/hugetlb_vmemmap.h |  9 ++++---
> >> mm/internal.h        |  3 +++
> >> mm/mm_init.c         |  2 +-
> >> 5 files changed, 62 insertions(+), 15 deletions(-)
> > 
> > As mentioned, in general this looks good.  One small point below.
> > 
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index c32ca241df4b..540e0386514e 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >> }
> >> 
> >> found:
> >> +
> >> + 	/*
> >> + 	 * Only initialize the head struct page in memmap_init_reserved_pages,
> >> + 	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
> >> + 	 * The head struct page is used to get folio information by the HugeTLB
> >> + 	 * subsystem like zone id and node id.
> >> + 	 */
> >> + 	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
> >> + 	huge_page_size(h) - PAGE_SIZE);
> >> 	/* Put them into a private list first because mem_map is not up yet */
> >> 	INIT_LIST_HEAD(&m->list);
> >> 	list_add(&m->list, &huge_boot_pages);
> >> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >> 	return 1;
> >> }
> >> 
> >> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
> >> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> >> +     		unsigned long start_page_number,
> >> +     		unsigned long end_page_number)
> >> +{
> >> + 	enum zone_type zone = zone_idx(folio_zone(folio));
> >> + 	int nid = folio_nid(folio);
> >> + 	unsigned long head_pfn = folio_pfn(folio);
> >> + 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
> >> +
> >> + 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
> >> + 	struct page *page = pfn_to_page(pfn);
> >> +
> >> + 		__init_single_page(page, pfn, zone, nid);
> >> + 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
> >> + 		set_page_count(page, 0);
> >> + 	}
> >> +}
> >> +
> >> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
> >> +        unsigned long nr_pages)
> >> +{
> >> + 	int ret;
> >> +
> >> + 	/* Prepare folio head */
> >> +	 __folio_clear_reserved(folio);
> >> + 	__folio_set_head(folio);
> >> + 	ret = page_ref_freeze(&folio->page, 1);
> >> + 	VM_BUG_ON(!ret);
> > 
> > In the current code, we print a warning and free the associated pages to
> > buddy if we ever experience an increased ref count.  The routine
> > hugetlb_folio_init_tail_vmemmap does not check for this.
> > 
> > I do not believe speculative/temporary ref counts this early in the boot
> > process are possible.  It would be great to get input from someone else.
> 
> Yes, it is a very early stage and other tail struct pages haven't been
> initialized yet, anyone should not reference them. It it the same case
> as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
> 
> > 
> > When I wrote the existing code, it was fairly easy to WARN and continue
> > if we encountered an increased ref count.  Things would be bit more
> 
> In your case, I think it is not in the boot process, right?

They were calls in the same routine: gather_bootmem_prealloc().

> > complicated here.  So, it may not be worth the effort.
> 
> Agree. Note that tail struct pages are not initialized here, if we want to
> handle head page, how to handle tail pages? It really cannot resolved.
> We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT
> that anyone should not reference those pages.

Agree that speculative refs should not happen this early.  How about making
the following changes?
- Instead of set_page_count() in hugetlb_folio_init_tail_vmemmap, do a
  page_ref_freeze and VM_BUG_ON if not ref_count != 1.
- In the commit message, mention 'The WARN_ON for increased ref count in
  gather_bootmem_prealloc was changed to a VM_BUG_ON.  This is OK as
  there should be no speculative references this early in boot process.
  The VM_BUG_ON's are there just in case such code is introduced.'
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-08 18:29       ` Mike Kravetz
@ 2023-09-08 20:48         ` Usama Arif
  0 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2023-09-08 20:48 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song
  Cc: Linux-MM, Mike Rapoport (IBM),
	LKML, Muchun Song, fam.zheng, liangma, punit.agrawal



On 08/09/2023 19:29, Mike Kravetz wrote:
> On 09/08/23 10:39, Muchun Song wrote:
>>
>>
>>> On Sep 8, 2023, at 02:37, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 09/06/23 12:26, Usama Arif wrote:
>>>> The new boot flow when it comes to initialization of gigantic pages
>>>> is as follows:
>>>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
>>>> the region after the first struct page is marked as noinit.
>>>> - This results in only the first struct page to be
>>>> initialized in reserve_bootmem_region. As the tail struct pages are
>>>> not initialized at this point, there can be a significant saving
>>>> in boot time if HVO succeeds later on.
>>>> - Later on in the boot, the head page is prepped and the first
>>>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
>>>> are initialized.
>>>> - HVO is attempted. If it is not successful, then the rest of the
>>>> tail struct pages are initialized. If it is successful, no more
>>>> tail struct pages need to be initialized saving significant boot time.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>>>> ---
>>>> mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
>>>> mm/hugetlb_vmemmap.c |  2 +-
>>>> mm/hugetlb_vmemmap.h |  9 ++++---
>>>> mm/internal.h        |  3 +++
>>>> mm/mm_init.c         |  2 +-
>>>> 5 files changed, 62 insertions(+), 15 deletions(-)
>>>
>>> As mentioned, in general this looks good.  One small point below.
>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index c32ca241df4b..540e0386514e 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>>> }
>>>>
>>>> found:
>>>> +
>>>> + 	/*
>>>> + 	 * Only initialize the head struct page in memmap_init_reserved_pages,
>>>> + 	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
>>>> + 	 * The head struct page is used to get folio information by the HugeTLB
>>>> + 	 * subsystem like zone id and node id.
>>>> + 	 */
>>>> + 	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
>>>> + 	huge_page_size(h) - PAGE_SIZE);
>>>> 	/* Put them into a private list first because mem_map is not up yet */
>>>> 	INIT_LIST_HEAD(&m->list);
>>>> 	list_add(&m->list, &huge_boot_pages);
>>>> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>>> 	return 1;
>>>> }
>>>>
>>>> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
>>>> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
>>>> +     		unsigned long start_page_number,
>>>> +     		unsigned long end_page_number)
>>>> +{
>>>> + 	enum zone_type zone = zone_idx(folio_zone(folio));
>>>> + 	int nid = folio_nid(folio);
>>>> + 	unsigned long head_pfn = folio_pfn(folio);
>>>> + 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
>>>> +
>>>> + 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
>>>> + 	struct page *page = pfn_to_page(pfn);
>>>> +
>>>> + 		__init_single_page(page, pfn, zone, nid);
>>>> + 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
>>>> + 		set_page_count(page, 0);
>>>> + 	}
>>>> +}
>>>> +
>>>> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
>>>> +        unsigned long nr_pages)
>>>> +{
>>>> + 	int ret;
>>>> +
>>>> + 	/* Prepare folio head */
>>>> +	 __folio_clear_reserved(folio);
>>>> + 	__folio_set_head(folio);
>>>> + 	ret = page_ref_freeze(&folio->page, 1);
>>>> + 	VM_BUG_ON(!ret);
>>>
>>> In the current code, we print a warning and free the associated pages to
>>> buddy if we ever experience an increased ref count.  The routine
>>> hugetlb_folio_init_tail_vmemmap does not check for this.
>>>
>>> I do not believe speculative/temporary ref counts this early in the boot
>>> process are possible.  It would be great to get input from someone else.
>>
>> Yes, it is a very early stage and other tail struct pages haven't been
>> initialized yet, anyone should not reference them. It it the same case
>> as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>>>
>>> When I wrote the existing code, it was fairly easy to WARN and continue
>>> if we encountered an increased ref count.  Things would be bit more
>>
>> In your case, I think it is not in the boot process, right?
> 
> They were calls in the same routine: gather_bootmem_prealloc().
> 
>>> complicated here.  So, it may not be worth the effort.
>>
>> Agree. Note that tail struct pages are not initialized here, if we want to
>> handle head page, how to handle tail pages? It really cannot resolved.
>> We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> that anyone should not reference those pages.
> 
> Agree that speculative refs should not happen this early.  How about making
> the following changes?
> - Instead of set_page_count() in hugetlb_folio_init_tail_vmemmap, do a
>    page_ref_freeze and VM_BUG_ON if not ref_count != 1.
> - In the commit message, mention 'The WARN_ON for increased ref count in
>    gather_bootmem_prealloc was changed to a VM_BUG_ON.  This is OK as
>    there should be no speculative references this early in boot process.
>    The VM_BUG_ON's are there just in case such code is introduced.'

Sounds good, although its not possible for the refcnt to not be 1 as 
there isnt anything that happens between __init_single_page and 
setting/freezing refcnt to 0. I will include the below diff in the next 
revision with the explanation in commit message as suggested.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 540e0386514e..ed37c6e4e952 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3194,13 +3194,15 @@ static void __init 
hugetlb_folio_init_tail_vmemmap(struct folio *folio,
         int nid = folio_nid(folio);
         unsigned long head_pfn = folio_pfn(folio);
         unsigned long pfn, end_pfn = head_pfn + end_page_number;
+       int ret;

         for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
                 struct page *page = pfn_to_page(pfn);

                 __init_single_page(page, pfn, zone, nid);
                 prep_compound_tail((struct page *)folio, pfn - head_pfn);
-               set_page_count(page, 0);
+               ret = page_ref_freeze(page, 1);
+               VM_BUG_ON(!ret);
         }
  }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
  2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
                   ` (3 preceding siblings ...)
  2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
@ 2023-09-22 14:42 ` Pasha Tatashin
  4 siblings, 0 replies; 17+ messages in thread
From: Pasha Tatashin @ 2023-09-22 14:42 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, muchun.song, mike.kravetz, rppt, linux-kernel,
	songmuchun, fam.zheng, liangma, punit.agrawal

On Wed, Sep 6, 2023 at 7:26 AM Usama Arif <usama.arif@bytedance.com> wrote:
>
> This series moves the boot time initialization of tail struct pages of a
> gigantic page to later on in the boot. Only the
> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> are initialized at the start. If HVO is successful, then no more tail struct
> pages need to be initialized. For a 1G hugepage, this series avoid
> initialization of 262144 - 63 = 262081 struct pages per hugepage.
>
> When tested on a 512G system (which can allocate max 500 1G hugepages), the
> kexec-boot time with HVO and DEFERRED_STRUCT_PAGE_INIT enabled without this
> patchseries to running init is 3.9 seconds. With this patch it is 1.2 seconds.
> This represents an approximately 70% reduction in boot time and will
> significantly reduce server downtime when using a large number of
> gigantic pages.

My use case is different, but this patch series benefits it. I have a
virtual machines with a large number of hugetlb pages. The RSS size of
the VM after boot is much smaller with this series:

Before: 9G
After: 600M

The VM has 500 1G pages, and 512G total RAM. I would add this to the
description, that this series can help reduce the VM overhead and boot
performance for those who are using hugetlb pages in the VMs.

Also, DEFERRED_STRUCT_PAGE_INIT is a requirement for this series to
work, and should be added into documentation.

Pasha

> Thanks,
> Usama
>
> [v3->v4]:
> - rebase ontop of patch "hugetlb: set hugetlb page flag before optimizing vmemmap".
> - freeze head struct page ref count.
> - Change order of operations to initialize head struct page -> initialize
> the necessary tail struct pages -> attempt HVO -> initialize the rest of the
> tail struct pages if HVO fails.
> - (Mike Rapoport and Muchun Song) remove "_vmemmap" suffix from memblock reserve
> noinit flags anf functions.
>
> [v2->v3]:
> - (Muchun Song) skip prep of struct pages backing gigantic hugepages
> at boot time only.
> - (Muchun Song) move initialization of tail struct pages to after
> HVO is attempted.
>
> [v1->v2]:
> - (Mike Rapoport) Code quality improvements (function names, arguments,
> comments).
>
> [RFC->v1]:
> - (Mike Rapoport) Change from passing hugepage_size in
> memblock_alloc_try_nid_raw for skipping struct page initialization to
> using MEMBLOCK_RSRV_NOINIT flag
>
> Usama Arif (4):
>   mm: hugetlb_vmemmap: Use nid of the head page to reallocate it
>   memblock: pass memblock_type to memblock_setclr_flag
>   memblock: introduce MEMBLOCK_RSRV_NOINIT flag
>   mm: hugetlb: Skip initialization of gigantic tail struct pages if
>     freed by HVO
>
>  include/linux/memblock.h |  9 ++++++
>  mm/hugetlb.c             | 61 ++++++++++++++++++++++++++++++++++------
>  mm/hugetlb_vmemmap.c     |  4 +--
>  mm/hugetlb_vmemmap.h     |  9 +++---
>  mm/internal.h            |  3 ++
>  mm/memblock.c            | 48 ++++++++++++++++++++++---------
>  mm/mm_init.c             |  2 +-
>  7 files changed, 107 insertions(+), 29 deletions(-)
>
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-09-22 14:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
2023-09-06 11:26 ` [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it Usama Arif
2023-09-06 11:26 ` [v4 2/4] memblock: pass memblock_type to memblock_setclr_flag Usama Arif
2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
2023-09-06 11:35   ` Muchun Song
2023-09-06 12:01   ` Mike Rapoport
2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
2023-09-06 18:10   ` Mike Kravetz
2023-09-06 21:27     ` [External] " Usama Arif
2023-09-06 21:59       ` Mike Kravetz
2023-09-07 10:14         ` Usama Arif
2023-09-07 18:24           ` Mike Kravetz
2023-09-07 18:37   ` Mike Kravetz
2023-09-08  2:39     ` Muchun Song
2023-09-08 18:29       ` Mike Kravetz
2023-09-08 20:48         ` [External] " Usama Arif
2023-09-22 14:42 ` [v4 0/4] " Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.