All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/8] Batch hugetlb vmemmap modification operations
@ 2023-09-25 23:48 Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1].  The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower.  Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x!  Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing.  Each time we modify the kernel pagetable we need to flush
the TLB.  For each hugetlb that is optimized, there could be potentially
two TLB flushes performed.  One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split.  The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded.  So, the flush
time increases with the number of CPUs.  In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive.  Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages.  Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages.  One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages.  Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on mm-unstable (September 24)

Changes v5 -> v6:
- patch 4 in bulk_vmemmap_restore_error remove folio from list before
  calling add_hugetlb_folio.
- Added Muchun RB for patches 2 and 3

Changes v4 -> v5:
- patch 3 comment style updated, unnecessary INIT_LIST_HEAD
- patch 4 updated hugetlb_vmemmap_restore_folios to pass back number of
  restored folios in non-error case.  In addition, routine passes back
  list of folios with vmemmmap.  Naming more consistent.
- patch 5 remover over optimization and added Muchun RB
- patch 6 break and early return in ENOMEM case.  Updated comments.
  Added Muchun RB.
- patch 7 Updated comments about splitting failure.  Added Muchun RB.
- patch 8 Made comments consistent.

Changes v3 -> v4:
- Rebased on mm-unstable and dropped requisite patches.
- patch 2 updated to take bootmem vmemmap initialization into account
- patch 3 more changes for bootmem hugetlb pages.  added routine
  prep_and_add_bootmem_folios.
- patch 5 in hugetlb_vmemmap_optimize_folios on ENOMEM check for
  list_empty before freeing and retry.  This is more important in
  subsequent patch where we flush_tlb_all after ENOMEM.

Changes v2 -> v3:
- patch 5 was part of an earlier series that was not picked up.  It is
  included here as it helps with batching optimizations.
- patch 6 hugetlb_vmemmap_restore_folios is changed from type void to
  returning an error code as well as an additional output parameter providing
  the number folios for which vmemmap was actually restored.  The caller can
  then be more intelligent about processing the list.
- patch 9 eliminate local list in vmemmap_restore_pte.  The routine
  hugetlb_vmemmap_optimize_folios checks for ENOMEM and frees accumulated
  vmemmap pages while processing the list.
- patch 10 introduce flags field to struct vmemmap_remap_walk and
  VMEMMAP_SPLIT_NO_TLB_FLUSH for not flushing during pass to split PMDs.
- patch 11 rename flag VMEMMAP_REMAP_NO_TLB_FLUSH and pass in from callers.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
  pages with hugetlb flag set can be passed to vmemmmap routines.  This
  involved separating the 'prep' of hugetlb pages even further.  The code
  dealing with bootmem allocations was also modified so that batching is
  possible.  Adding a 'batch' of hugetlb pages to their respective free
  lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
  (Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
  for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
  is being performed (Muchun).  Removed unused variable and
  hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
  allow for future expansion (Muchun).  Single flag VMEMMAP_NO_TLB_FLUSH.
  Provide detailed comment about the need to keep old and new vmemmap pages
  in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).

Joao Martins (2):
  hugetlb: batch PMD split for bulk vmemmap dedup
  hugetlb: batch TLB flushes when freeing vmemmap

Mike Kravetz (6):
  hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
  hugetlb: restructure pool allocations
  hugetlb: perform vmemmap optimization on a list of pages
  hugetlb: perform vmemmap restoration on a list of pages
  hugetlb: batch freeing of vmemmap pages
  hugetlb: batch TLB flushes when restoring vmemmap

 mm/hugetlb.c         | 301 ++++++++++++++++++++++++++++++++++++-------
 mm/hugetlb_vmemmap.c | 273 +++++++++++++++++++++++++++++++++------
 mm/hugetlb_vmemmap.h |  15 +++
 3 files changed, 506 insertions(+), 83 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 2/8] hugetlb: restructure pool allocations Mike Kravetz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, James Houghton

update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators.  This may require
allocating vmemmmap pages associated with each hugetlb page.  The
hugetlb page destructor must be changed before pages are freed to lower
level allocators.  However, the destructor must be changed under the
hugetlb lock.  This means there is potentially one lock cycle per page.

Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..47159b9de633 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1837,7 +1837,46 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
 	struct folio *folio, *t_folio;
+	bool clear_dtor = false;
 
+	/*
+	 * First allocate required vmemmmap (if necessary) for all folios on
+	 * list.  If vmemmap can not be allocated, we can not free folio to
+	 * lower level allocator, so add back as hugetlb surplus page.
+	 * add_hugetlb_folio() removes the page from THIS list.
+	 * Use clear_dtor to note if vmemmap was successfully allocated for
+	 * ANY page on the list.
+	 */
+	list_for_each_entry_safe(folio, t_folio, list, lru) {
+		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+			if (hugetlb_vmemmap_restore(h, &folio->page)) {
+				spin_lock_irq(&hugetlb_lock);
+				add_hugetlb_folio(h, folio, true);
+				spin_unlock_irq(&hugetlb_lock);
+			} else
+				clear_dtor = true;
+		}
+	}
+
+	/*
+	 * If vmemmmap allocation was performed on any folio above, take lock
+	 * to clear destructor of all folios on list.  This avoids the need to
+	 * lock/unlock for each individual folio.
+	 * The assumption is vmemmap allocation was performed on all or none
+	 * of the folios on the list.  This is true expect in VERY rare cases.
+	 */
+	if (clear_dtor) {
+		spin_lock_irq(&hugetlb_lock);
+		list_for_each_entry(folio, list, lru)
+			__clear_hugetlb_destructor(h, folio);
+		spin_unlock_irq(&hugetlb_lock);
+	}
+
+	/*
+	 * Free folios back to low level allocators.  vmemmap and destructors
+	 * were taken care of above, so update_and_free_hugetlb_folio will
+	 * not need to take hugetlb lock.
+	 */
 	list_for_each_entry_safe(folio, t_folio, list, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-27 11:26   ` Konrad Dybcio
  2023-09-25 23:48 ` [PATCH v6 3/8] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page.  This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done.  For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once.  To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

The code to allocate hugetlb pages from bootmem was also modified to
allow batching.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 179 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 140 insertions(+), 39 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 47159b9de633..64f50f3844fc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1970,16 +1970,21 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 	h->nr_huge_pages_node[nid]++;
 }
 
-static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	folio_set_hugetlb(folio);
-	hugetlb_vmemmap_optimize(h, &folio->page);
 	INIT_LIST_HEAD(&folio->lru);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
 }
 
+static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+{
+	init_new_hugetlb_folio(h, folio);
+	hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid)
 {
 	__prep_new_hugetlb_folio(h, folio);
@@ -2190,16 +2195,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	return page_folio(page);
 }
 
-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen':  ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+				gfp_t gfp_mask, int nid, nodemask_t *nmask,
+				nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
 	bool retry = false;
@@ -2212,6 +2210,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 				nid, nmask, node_alloc_noretry);
 	if (!folio)
 		return NULL;
+
 	if (hstate_is_gigantic(h)) {
 		if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			/*
@@ -2226,32 +2225,80 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 			return NULL;
 		}
 	}
-	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
 
 	return folio;
 }
 
+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
+{
+	struct folio *folio;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (folio)
+		init_new_hugetlb_folio(h, folio);
+	return folio;
+}
+
 /*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen':  ref count of head page and all tail
+ * pages is zero.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-				nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
-	int nr_nodes, node;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (!folio)
+		return NULL;
+
+	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+	return folio;
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+					struct list_head *folio_list)
+{
+	struct folio *folio, *tmp_f;
+
+	/* Add all new pool pages to free lists in one lock cycle */
+	spin_lock_irq(&hugetlb_lock);
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		__prep_account_new_huge_page(h, folio_nid(folio));
+		enqueue_hugetlb_folio(h, folio);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner.  The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+					nodemask_t *nodes_allowed,
+					nodemask_t *node_alloc_noretry)
+{
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nr_nodes, node;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+		struct folio *folio;
+
+		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
 					nodes_allowed, node_alloc_noretry);
-		if (folio) {
-			free_huge_folio(folio); /* free it into the hugepage allocator */
-			return 1;
-		}
+		if (folio)
+			return folio;
 	}
 
-	return 0;
+	return NULL;
 }
 
 /*
@@ -3264,25 +3311,35 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
  */
 static void __init gather_bootmem_prealloc(void)
 {
+	LIST_HEAD(folio_list);
 	struct huge_bootmem_page *m;
+	struct hstate *h, *prev_h = NULL;
 
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
-		struct hstate *h = m->hstate;
+
+		h = m->hstate;
+		/*
+		 * It is possible to have multiple huge page sizes (hstates)
+		 * in this list.  If so, process each size separately.
+		 */
+		if (h != prev_h && prev_h != NULL)
+			prep_and_add_allocated_folios(prev_h, &folio_list);
+		prev_h = h;
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(folio_ref_count(folio) != 1);
 
 		hugetlb_folio_init_vmemmap(folio, h,
 					   HUGETLB_VMEMMAP_RESERVE_PAGES);
-		prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+		__prep_new_hugetlb_folio(h, folio);
 		/* If HVO fails, initialize all tail struct pages */
 		if (!HPageVmemmapOptimized(&folio->page))
 			hugetlb_folio_init_tail_vmemmap(folio,
 						HUGETLB_VMEMMAP_RESERVE_PAGES,
 						pages_per_huge_page(h));
-		free_huge_folio(folio); /* add to the hugepage allocator */
+		list_add(&folio->lru, &folio_list);
 
 		/*
 		 * We need to restore the 'stolen' pages to totalram_pages
@@ -3292,6 +3349,8 @@ static void __init gather_bootmem_prealloc(void)
 		adjust_managed_page_count(page, pages_per_huge_page(h));
 		cond_resched();
 	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
 }
 
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3325,9 +3384,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ *   pages are allocated from memblock allocated or something similar.
+ *   Gigantic pages are actually added to pools later with the routine
+ *   gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ *   all of mm is up and functional.  Pages are allocated from buddy and
+ *   then added to hugetlb pools.
+ */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
 	bool node_specific_alloc = false;
 
@@ -3369,14 +3441,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
+			/*
+			 * gigantic pages not added to list as they are not
+			 * added to pools now.
+			 */
 			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
-		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY],
-					 node_alloc_noretry))
-			break;
+		} else {
+			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+							node_alloc_noretry);
+			if (!folio)
+				break;
+			list_add(&folio->lru, &folio_list);
+		}
 		cond_resched();
 	}
+
+	/* list will be empty if hstate_is_gigantic */
+	prep_and_add_allocated_folios(h, &folio_list);
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
@@ -3510,7 +3593,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
-	unsigned long min_count, ret;
+	unsigned long min_count;
+	unsigned long allocated;
+	struct folio *folio;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3587,7 +3672,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	allocated = 0;
+	while (count > (persistent_huge_pages(h) + allocated)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_folio will handle it by freeing the page
@@ -3598,15 +3684,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed,
+		folio = alloc_pool_huge_folio(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock_irq(&hugetlb_lock);
-		if (!ret)
+		if (!folio) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		list_add(&folio->lru, &page_list);
+		allocated++;
 
 		/* Bail for signals. Probably ctrl-c from user */
-		if (signal_pending(current))
+		if (signal_pending(current)) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		spin_lock_irq(&hugetlb_lock);
+	}
+
+	/* Add allocated pages to the pool */
+	if (!list_empty(&page_list)) {
+		spin_unlock_irq(&hugetlb_lock);
+		prep_and_add_allocated_folios(h, &page_list);
+		spin_lock_irq(&hugetlb_lock);
 	}
 
 	/*
@@ -3632,8 +3735,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		struct folio *folio;
-
 		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
 		if (!folio)
 			break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 3/8] hugetlb: perform vmemmap optimization on a list of pages
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 2/8] hugetlb: restructure pool allocations Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 4/8] hugetlb: perform vmemmap restoration " Mike Kravetz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When adding hugetlb pages to the pool, we first create a list of the
allocated pages before adding to the pool.  Pass this list of pages to a
new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.

Due to significant differences in vmemmmap initialization for bootmem
allocated hugetlb pages, a new routine prep_and_add_bootmem_folios
is created.

We also modify the routine vmemmap_should_optimize() to check for pages
that are already optimized.  There are code paths that might request
vmemmap optimization twice and we want to make sure this is not
attempted.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         | 42 ++++++++++++++++++++++++++++++++++--------
 mm/hugetlb_vmemmap.c | 11 +++++++++++
 mm/hugetlb_vmemmap.h |  5 +++++
 3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64f50f3844fc..da0ebd370b5f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2269,6 +2269,9 @@ static void prep_and_add_allocated_folios(struct hstate *h,
 {
 	struct folio *folio, *tmp_f;
 
+	/* Send list for bulk vmemmap optimization processing */
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
+
 	/* Add all new pool pages to free lists in one lock cycle */
 	spin_lock_irq(&hugetlb_lock);
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
@@ -3305,6 +3308,34 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
 	prep_compound_head((struct page *)folio, huge_page_order(h));
 }
 
+static void __init prep_and_add_bootmem_folios(struct hstate *h,
+					struct list_head *folio_list)
+{
+	struct folio *folio, *tmp_f;
+
+	/* Send list for bulk vmemmap optimization processing */
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
+
+	/* Add all new pool pages to free lists in one lock cycle */
+	spin_lock_irq(&hugetlb_lock);
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
+			/*
+			 * If HVO fails, initialize all tail struct pages
+			 * We do not worry about potential long lock hold
+			 * time as this is early in boot and there should
+			 * be no contention.
+			 */
+			hugetlb_folio_init_tail_vmemmap(folio,
+					HUGETLB_VMEMMAP_RESERVE_PAGES,
+					pages_per_huge_page(h));
+		}
+		__prep_account_new_huge_page(h, folio_nid(folio));
+		enqueue_hugetlb_folio(h, folio);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+}
+
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_ORDER) pages.
@@ -3325,7 +3356,7 @@ static void __init gather_bootmem_prealloc(void)
 		 * in this list.  If so, process each size separately.
 		 */
 		if (h != prev_h && prev_h != NULL)
-			prep_and_add_allocated_folios(prev_h, &folio_list);
+			prep_and_add_bootmem_folios(prev_h, &folio_list);
 		prev_h = h;
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
@@ -3333,12 +3364,7 @@ static void __init gather_bootmem_prealloc(void)
 
 		hugetlb_folio_init_vmemmap(folio, h,
 					   HUGETLB_VMEMMAP_RESERVE_PAGES);
-		__prep_new_hugetlb_folio(h, folio);
-		/* If HVO fails, initialize all tail struct pages */
-		if (!HPageVmemmapOptimized(&folio->page))
-			hugetlb_folio_init_tail_vmemmap(folio,
-						HUGETLB_VMEMMAP_RESERVE_PAGES,
-						pages_per_huge_page(h));
+		init_new_hugetlb_folio(h, folio);
 		list_add(&folio->lru, &folio_list);
 
 		/*
@@ -3350,7 +3376,7 @@ static void __init gather_bootmem_prealloc(void)
 		cond_resched();
 	}
 
-	prep_and_add_allocated_folios(h, &folio_list);
+	prep_and_add_bootmem_folios(h, &folio_list);
 }
 
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 76682d1d79a7..4558b814ffab 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -483,6 +483,9 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
+	if (HPageVmemmapOptimized((struct page *)head))
+		return false;
+
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return false;
 
@@ -572,6 +575,14 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 		SetHPageVmemmapOptimized(head);
 }
 
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+	struct folio *folio;
+
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 4573899855d7..c512e388dbb4 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -20,6 +20,7 @@
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
 static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
 {
@@ -48,6 +49,10 @@ static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page
 {
 }
 
+static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+}
+
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 4/8] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (2 preceding siblings ...)
  2023-09-25 23:48 ` [PATCH v6 3/8] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-26  2:27   ` Muchun Song
  2023-09-29 22:10   ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 5/8] hugetlb: batch freeing of vmemmap pages Mike Kravetz
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

The routine update_and_free_pages_bulk already performs vmemmap
restoration on the list of hugetlb pages in a separate step.  In
preparation for more functionality to be added in this step, create a
new routine hugetlb_vmemmap_restore_folios() that will restore
vmemmap for a list of folios.

This new routine must provide sufficient feedback about errors and
actual restoration performed so that update_and_free_pages_bulk can
perform optimally.

Special care must be taken when encountering an error from
hugetlb_vmemmap_restore_folios.  We want to continue making as much
forward progress as possible.  A new routine bulk_vmemmap_restore_error
handles this specific situation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         | 99 +++++++++++++++++++++++++++++++-------------
 mm/hugetlb_vmemmap.c | 38 +++++++++++++++++
 mm/hugetlb_vmemmap.h | 10 +++++
 3 files changed, 119 insertions(+), 28 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index da0ebd370b5f..c484bb74201a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1834,50 +1834,93 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 		schedule_work(&free_hpage_work);
 }
 
-static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+static void bulk_vmemmap_restore_error(struct hstate *h,
+					struct list_head *folio_list,
+					struct list_head *non_hvo_folios)
 {
 	struct folio *folio, *t_folio;
-	bool clear_dtor = false;
 
-	/*
-	 * First allocate required vmemmmap (if necessary) for all folios on
-	 * list.  If vmemmap can not be allocated, we can not free folio to
-	 * lower level allocator, so add back as hugetlb surplus page.
-	 * add_hugetlb_folio() removes the page from THIS list.
-	 * Use clear_dtor to note if vmemmap was successfully allocated for
-	 * ANY page on the list.
-	 */
-	list_for_each_entry_safe(folio, t_folio, list, lru) {
-		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+	if (!list_empty(non_hvo_folios)) {
+		/*
+		 * Free any restored hugetlb pages so that restore of the
+		 * entire list can be retried.
+		 * The idea is that in the common case of ENOMEM errors freeing
+		 * hugetlb pages with vmemmap we will free up memory so that we
+		 * can allocate vmemmap for more hugetlb pages.
+		 */
+		list_for_each_entry_safe(folio, t_folio, non_hvo_folios, lru) {
+			list_del(&folio->lru);
+			spin_lock_irq(&hugetlb_lock);
+			__clear_hugetlb_destructor(h, folio);
+			spin_unlock_irq(&hugetlb_lock);
+			update_and_free_hugetlb_folio(h, folio, false);
+			cond_resched();
+		}
+	} else {
+		/*
+		 * In the case where there are no folios which can be
+		 * immediately freed, we loop through the list trying to restore
+		 * vmemmap individually in the hope that someone elsewhere may
+		 * have done something to cause success (such as freeing some
+		 * memory).  If unable to restore a hugetlb page, the hugetlb
+		 * page is made a surplus page and removed from the list.
+		 * If are able to restore vmemmap and free one hugetlb page, we
+		 * quit processing the list to retry the bulk operation.
+		 */
+		list_for_each_entry_safe(folio, t_folio, folio_list, lru)
 			if (hugetlb_vmemmap_restore(h, &folio->page)) {
+				list_del(&folio->lru);
 				spin_lock_irq(&hugetlb_lock);
 				add_hugetlb_folio(h, folio, true);
 				spin_unlock_irq(&hugetlb_lock);
-			} else
-				clear_dtor = true;
-		}
+			} else {
+				list_del(&folio->lru);
+				spin_lock_irq(&hugetlb_lock);
+				__clear_hugetlb_destructor(h, folio);
+				spin_unlock_irq(&hugetlb_lock);
+				update_and_free_hugetlb_folio(h, folio, false);
+				cond_resched();
+				break;
+			}
 	}
+}
+
+static void update_and_free_pages_bulk(struct hstate *h,
+						struct list_head *folio_list)
+{
+	long ret;
+	struct folio *folio, *t_folio;
+	LIST_HEAD(non_hvo_folios);
 
 	/*
-	 * If vmemmmap allocation was performed on any folio above, take lock
-	 * to clear destructor of all folios on list.  This avoids the need to
-	 * lock/unlock for each individual folio.
-	 * The assumption is vmemmap allocation was performed on all or none
-	 * of the folios on the list.  This is true expect in VERY rare cases.
+	 * First allocate required vmemmmap (if necessary) for all folios.
+	 * Carefully handle errors and free up any available hugetlb pages
+	 * in an effort to make forward progress.
 	 */
-	if (clear_dtor) {
+retry:
+	ret = hugetlb_vmemmap_restore_folios(h, folio_list, &non_hvo_folios);
+	if (ret < 0) {
+		bulk_vmemmap_restore_error(h, folio_list, &non_hvo_folios);
+		goto retry;
+	}
+
+	/*
+	 * At this point, list should be empty, ret should be >= 0 and there
+	 * should only be pages on the non_hvo_folios list.
+	 * Do note that the non_hvo_folios list could be empty.
+	 * Without HVO enabled, ret will be 0 and there is no need to call
+	 * __clear_hugetlb_destructor as this was done previously.
+	 */
+	VM_WARN_ON(!list_empty(folio_list));
+	VM_WARN_ON(ret < 0);
+	if (!list_empty(&non_hvo_folios) && ret) {
 		spin_lock_irq(&hugetlb_lock);
-		list_for_each_entry(folio, list, lru)
+		list_for_each_entry(folio, &non_hvo_folios, lru)
 			__clear_hugetlb_destructor(h, folio);
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
-	/*
-	 * Free folios back to low level allocators.  vmemmap and destructors
-	 * were taken care of above, so update_and_free_hugetlb_folio will
-	 * not need to take hugetlb lock.
-	 */
-	list_for_each_entry_safe(folio, t_folio, list, lru) {
+	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
 	}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4558b814ffab..77f44b81ff01 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -480,6 +480,44 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+/**
+ * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
+ * @h:			hstate.
+ * @folio_list:		list of folios.
+ * @non_hvo_folios:	Output list of folios for which vmemmap exists.
+ *
+ * Return: number of folios for which vmemmap was restored, or an error code
+ *		if an error was encountered restoring vmemmap for a folio.
+ *		Folios that have vmemmap are moved to the non_hvo_folios
+ *		list.  Processing of entries stops when the first error is
+ *		encountered. The folio that experienced the error and all
+ *		non-processed folios will remain on folio_list.
+ */
+long hugetlb_vmemmap_restore_folios(const struct hstate *h,
+					struct list_head *folio_list,
+					struct list_head *non_hvo_folios)
+{
+	struct folio *folio, *t_folio;
+	long restored = 0;
+	long ret = 0;
+
+	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
+		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+			ret = hugetlb_vmemmap_restore(h, &folio->page);
+			if (ret)
+				break;
+			restored++;
+		}
+
+		/* Add non-optimized folios to output list */
+		list_move(&folio->lru, non_hvo_folios);
+	}
+
+	if (!ret)
+		ret = restored;
+	return ret;
+}
+
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index c512e388dbb4..0b7710f90e38 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -19,6 +19,9 @@
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+long hugetlb_vmemmap_restore_folios(const struct hstate *h,
+					struct list_head *folio_list,
+					struct list_head *non_hvo_folios);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
@@ -45,6 +48,13 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
 	return 0;
 }
 
+static long hugetlb_vmemmap_restore_folios(const struct hstate *h,
+					struct list_head *folio_list,
+					struct list_head *non_hvo_folios)
+{
+	return 0;
+}
+
 static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 5/8] hugetlb: batch freeing of vmemmap pages
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (3 preceding siblings ...)
  2023-09-25 23:48 ` [PATCH v6 4/8] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 6/8] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Now that batching of hugetlb vmemmap optimization processing is possible,
batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
hugetlb page, we add them to a list that is freed after the entire batch
has been processed.

This enhances the ability to return contiguous ranges of memory to the
low level allocators.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 82 ++++++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 77f44b81ff01..4ac521e596db 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -251,7 +251,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 	}
 
 	entry = mk_pte(walk->reuse_page, pgprot);
-	list_add_tail(&page->lru, walk->vmemmap_pages);
+	list_add(&page->lru, walk->vmemmap_pages);
 	set_pte_at(&init_mm, addr, pte, entry);
 }
 
@@ -306,18 +306,20 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
+ *		responsibility to free pages.
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
-			      unsigned long reuse)
+			      unsigned long reuse,
+			      struct list_head *vmemmap_pages)
 {
 	int ret;
-	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
-		.vmemmap_pages	= &vmemmap_pages,
+		.vmemmap_pages	= vmemmap_pages,
 	};
 	int nid = page_to_nid((struct page *)reuse);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -334,7 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 	if (walk.reuse_page) {
 		copy_page(page_to_virt(walk.reuse_page),
 			  (void *)walk.reuse_addr);
-		list_add(&walk.reuse_page->lru, &vmemmap_pages);
+		list_add(&walk.reuse_page->lru, vmemmap_pages);
 	}
 
 	/*
@@ -365,15 +367,13 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		walk = (struct vmemmap_remap_walk) {
 			.remap_pte	= vmemmap_restore_pte,
 			.reuse_addr	= reuse,
-			.vmemmap_pages	= &vmemmap_pages,
+			.vmemmap_pages	= vmemmap_pages,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
 	}
 	mmap_read_unlock(&init_mm);
 
-	free_vmemmap_page_list(&vmemmap_pages);
-
 	return ret;
 }
 
@@ -389,7 +389,7 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
 		page = alloc_pages_node(nid, gfp_mask, 0);
 		if (!page)
 			goto out;
-		list_add_tail(&page->lru, list);
+		list_add(&page->lru, list);
 	}
 
 	return 0;
@@ -577,24 +577,17 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
 	return true;
 }
 
-/**
- * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be optimized.
- *
- * This function only tries to optimize @head's vmemmap pages and does not
- * guarantee that the optimization will succeed after it returns. The caller
- * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
- * have been optimized.
- */
-void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_optimize(const struct hstate *h,
+					struct page *head,
+					struct list_head *vmemmap_pages)
 {
+	int ret = 0;
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
 
 	VM_WARN_ON_ONCE(!PageHuge(head));
 	if (!vmemmap_should_optimize(h, head))
-		return;
+		return ret;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
 
@@ -604,21 +597,58 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 
 	/*
 	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
-	 * to the page which @vmemmap_reuse is mapped to, then free the pages
-	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
+	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
+	 * mapping the range to vmemmap_pages list so that they can be freed by
+	 * the caller.
 	 */
-	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
+	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages);
+	if (ret)
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
 	else
 		SetHPageVmemmapOptimized(head);
+
+	return ret;
+}
+
+/**
+ * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be optimized.
+ *
+ * This function only tries to optimize @head's vmemmap pages and does not
+ * guarantee that the optimization will succeed after it returns. The caller
+ * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
+ * have been optimized.
+ */
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+{
+	LIST_HEAD(vmemmap_pages);
+
+	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+	free_vmemmap_page_list(&vmemmap_pages);
 }
 
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
+	LIST_HEAD(vmemmap_pages);
+
+	list_for_each_entry(folio, folio_list, lru) {
+		int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
+								&vmemmap_pages);
+
+		/*
+		 * Pages to be freed may have been accumulated.  If we
+		 * encounter an ENOMEM,  free what we have and try again.
+		 */
+		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
+			free_vmemmap_page_list(&vmemmap_pages);
+			INIT_LIST_HEAD(&vmemmap_pages);
+			__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+		}
+	}
 
-	list_for_each_entry(folio, folio_list, lru)
-		hugetlb_vmemmap_optimize(h, &folio->page);
+	free_vmemmap_page_list(&vmemmap_pages);
 }
 
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 6/8] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (4 preceding siblings ...)
  2023-09-25 23:48 ` [PATCH v6 5/8] hugetlb: batch freeing of vmemmap pages Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 7/8] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

In an effort to minimize amount of TLB flushes, batch all PMD splits
belonging to a range of pages in order to perform only 1 (global) TLB
flush.

Add a flags field to the walker and pass whether it's a bulk allocation
or just a single page to decide to remap. First value
(VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
flush when we split the PMD.

Rebased and updated by Mike Kravetz

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 92 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 88 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4ac521e596db..10739e4285d5 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -27,6 +27,8 @@
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
  *			or is mapped from.
+ * @flags:		used to modify behavior in vmemmap page table walking
+ *			operations.
  */
 struct vmemmap_remap_walk {
 	void			(*remap_pte)(pte_t *pte, unsigned long addr,
@@ -35,9 +37,13 @@ struct vmemmap_remap_walk {
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
+
+/* Skip the TLB flush when we split the PMD */
+#define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
+	unsigned long		flags;
 };
 
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
 {
 	pmd_t __pmd;
 	int i;
@@ -80,7 +86,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
 		pmd_populate_kernel(&init_mm, pmd, pgtable);
-		flush_tlb_kernel_range(start, start + PMD_SIZE);
+		if (flush)
+			flush_tlb_kernel_range(start, start + PMD_SIZE);
 	} else {
 		pte_free_kernel(&init_mm, pgtable);
 	}
@@ -127,11 +134,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
 	do {
 		int ret;
 
-		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
+				!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH));
 		if (ret)
 			return ret;
 
 		next = pmd_addr_end(addr, end);
+
+		/*
+		 * We are only splitting, not remapping the hugetlb vmemmap
+		 * pages.
+		 */
+		if (!walk->remap_pte)
+			continue;
+
 		vmemmap_pte_range(pmd, addr, next, walk);
 	} while (pmd++, addr = next, addr != end);
 
@@ -198,7 +214,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	flush_tlb_kernel_range(start, end);
+	if (walk->remap_pte)
+		flush_tlb_kernel_range(start, end);
 
 	return 0;
 }
@@ -297,6 +314,36 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
 }
 
+/**
+ * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
+ *                      backing PMDs of the directmap into PTEs
+ * @start:     start address of the vmemmap virtual address range that we want
+ *             to remap.
+ * @end:       end address of the vmemmap virtual address range that we want to
+ *             remap.
+ * @reuse:     reuse address.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_split(unsigned long start, unsigned long end,
+				unsigned long reuse)
+{
+	int ret;
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= NULL,
+		.flags		= VMEMMAP_SPLIT_NO_TLB_FLUSH,
+	};
+
+	/* See the comment in the vmemmap_remap_free(). */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	mmap_read_lock(&init_mm);
+	ret = vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
+
+	return ret;
+}
+
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
  *			to the page which @reuse is mapped to, then free vmemmap
@@ -320,6 +367,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= vmemmap_pages,
+		.flags		= 0,
 	};
 	int nid = page_to_nid((struct page *)reuse);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -368,6 +416,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 			.remap_pte	= vmemmap_restore_pte,
 			.reuse_addr	= reuse,
 			.vmemmap_pages	= vmemmap_pages,
+			.flags		= 0,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
@@ -419,6 +468,7 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 		.remap_pte	= vmemmap_restore_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
+		.flags		= 0,
 	};
 
 	/* See the comment in the vmemmap_remap_free(). */
@@ -628,11 +678,45 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
+static int hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
+{
+	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
+	unsigned long vmemmap_reuse;
+
+	if (!vmemmap_should_optimize(h, head))
+		return 0;
+
+	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
+	vmemmap_reuse	= vmemmap_start;
+	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
+
+	/*
+	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
+	 * @vmemmap_end]
+	 */
+	return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+}
+
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
 	LIST_HEAD(vmemmap_pages);
 
+	list_for_each_entry(folio, folio_list, lru) {
+		int ret = hugetlb_vmemmap_split(h, &folio->page);
+
+		/*
+		 * Spliting the PMD requires allocating a page, thus lets fail
+		 * early once we encounter the first OOM. No point in retrying
+		 * as it can be dynamically done on remap with the memory
+		 * we get back from the vmemmap deduplication.
+		 */
+		if (ret == -ENOMEM)
+			break;
+	}
+
+	flush_tlb_all();
+
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
 								&vmemmap_pages);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 7/8] hugetlb: batch TLB flushes when freeing vmemmap
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (5 preceding siblings ...)
  2023-09-25 23:48 ` [PATCH v6 6/8] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-25 23:48 ` [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  7 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

Now that a list of pages is deduplicated at once, the TLB
flush can be batched for all vmemmap pages that got remapped.

Expand the flags field value to pass whether to skip the TLB flush
on remap of the PTE.

The TLB flush is global as we don't have guarantees from caller
that the set of folios is contiguous, or to add complexity in
composing a list of kVAs to flush.

Modified by Mike Kravetz to perform TLB flush on single folio if an
error is encountered.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 49 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 10739e4285d5..9df350372046 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -40,6 +40,8 @@ struct vmemmap_remap_walk {
 
 /* Skip the TLB flush when we split the PMD */
 #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
+/* Skip the TLB flush when we remap the PTE */
+#define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
 	unsigned long		flags;
 };
 
@@ -214,7 +216,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	if (walk->remap_pte)
+	if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
 		flush_tlb_kernel_range(start, end);
 
 	return 0;
@@ -355,19 +357,21 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
  * @reuse:	reuse address.
  * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
  *		responsibility to free pages.
+ * @flags:	modifications to vmemmap_remap_walk flags
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
 			      unsigned long reuse,
-			      struct list_head *vmemmap_pages)
+			      struct list_head *vmemmap_pages,
+			      unsigned long flags)
 {
 	int ret;
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= vmemmap_pages,
-		.flags		= 0,
+		.flags		= flags,
 	};
 	int nid = page_to_nid((struct page *)reuse);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -629,7 +633,8 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
 
 static int __hugetlb_vmemmap_optimize(const struct hstate *h,
 					struct page *head,
-					struct list_head *vmemmap_pages)
+					struct list_head *vmemmap_pages,
+					unsigned long flags)
 {
 	int ret = 0;
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -640,6 +645,18 @@ static int __hugetlb_vmemmap_optimize(const struct hstate *h,
 		return ret;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+	/*
+	 * Very Subtle
+	 * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
+	 * immediately after remapping.  As a result, subsequent accesses
+	 * and modifications to struct pages associated with the hugetlb
+	 * page could be to the OLD struct pages.  Set the vmemmap optimized
+	 * flag here so that it is copied to the new head page.  This keeps
+	 * the old and new struct pages in sync.
+	 * If there is an error during optimization, we will immediately FLUSH
+	 * the TLB and clear the flag below.
+	 */
+	SetHPageVmemmapOptimized(head);
 
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 	vmemmap_reuse	= vmemmap_start;
@@ -651,11 +668,12 @@ static int __hugetlb_vmemmap_optimize(const struct hstate *h,
 	 * mapping the range to vmemmap_pages list so that they can be freed by
 	 * the caller.
 	 */
-	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages);
-	if (ret)
+	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
+							vmemmap_pages, flags);
+	if (ret) {
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
-	else
-		SetHPageVmemmapOptimized(head);
+		ClearHPageVmemmapOptimized(head);
+	}
 
 	return ret;
 }
@@ -674,7 +692,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages, 0);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -719,19 +737,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
-								&vmemmap_pages);
+						&vmemmap_pages,
+						VMEMMAP_REMAP_NO_TLB_FLUSH);
 
 		/*
 		 * Pages to be freed may have been accumulated.  If we
 		 * encounter an ENOMEM,  free what we have and try again.
+		 * This can occur in the case that both spliting fails
+		 * halfway and head page allocation also failed. In this
+		 * case __hugetlb_vmemmap_optimize() would free memory
+		 * allowing more vmemmap remaps to occur.
 		 */
 		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
+			flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
-			__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+			__hugetlb_vmemmap_optimize(h, &folio->page,
+						&vmemmap_pages,
+						VMEMMAP_REMAP_NO_TLB_FLUSH);
 		}
 	}
 
+	flush_tlb_all();
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap
  2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (6 preceding siblings ...)
  2023-09-25 23:48 ` [PATCH v6 7/8] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
@ 2023-09-25 23:48 ` Mike Kravetz
  2023-09-26  2:20   ` Muchun Song
  7 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-09-25 23:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Update the internal hugetlb restore vmemmap code path such that TLB
flushing can be batched.  Use the existing mechanism of passing the
VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
performed for individual pages.  The routine hugetlb_vmemmap_restore_folios
is the only user of this new mechanism, and it will perform a global
flush after all vmemmap is restored.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 9df350372046..d2999c303031 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -461,18 +461,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @flags:	modifications to vmemmap_remap_walk flags
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			       unsigned long reuse)
+			       unsigned long reuse, unsigned long flags)
 {
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_restore_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
-		.flags		= 0,
+		.flags		= flags,
 	};
 
 	/* See the comment in the vmemmap_remap_free(). */
@@ -494,17 +495,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
 
-/**
- * hugetlb_vmemmap_restore - restore previously optimized (by
- *			     hugetlb_vmemmap_optimize()) vmemmap pages which
- *			     will be reallocated and remapped.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be restored.
- *
- * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
- * negative error code otherwise.
- */
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
 {
 	int ret;
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -525,7 +516,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	 * When a HugeTLB page is freed to the buddy allocator, previously
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
-	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
+	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
 	if (!ret) {
 		ClearHPageVmemmapOptimized(head);
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -534,6 +525,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ *				hugetlb_vmemmap_optimize()) vmemmap pages which
+ *				will be reallocated and remapped.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
+ */
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+{
+	return __hugetlb_vmemmap_restore(h, head, 0);
+}
+
 /**
  * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
  * @h:			hstate.
@@ -557,7 +563,8 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
-			ret = hugetlb_vmemmap_restore(h, &folio->page);
+			ret = __hugetlb_vmemmap_restore(h, &folio->page,
+						VMEMMAP_REMAP_NO_TLB_FLUSH);
 			if (ret)
 				break;
 			restored++;
@@ -567,6 +574,8 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 		list_move(&folio->lru, non_hvo_folios);
 	}
 
+	if (restored)
+		flush_tlb_all();
 	if (!ret)
 		ret = restored;
 	return ret;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap
  2023-09-25 23:48 ` [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
@ 2023-09-26  2:20   ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2023-09-26  2:20 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton



> On Sep 26, 2023, at 07:48, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> Update the internal hugetlb restore vmemmap code path such that TLB
> flushing can be batched.  Use the existing mechanism of passing the
> VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
> performed for individual pages.  The routine hugetlb_vmemmap_restore_folios
> is the only user of this new mechanism, and it will perform a global
> flush after all vmemmap is restored.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 4/8] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-25 23:48 ` [PATCH v6 4/8] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-09-26  2:27   ` Muchun Song
  2023-09-29 22:10   ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Muchun Song @ 2023-09-26  2:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton



> On Sep 26, 2023, at 07:48, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> The routine update_and_free_pages_bulk already performs vmemmap
> restoration on the list of hugetlb pages in a separate step.  In
> preparation for more functionality to be added in this step, create a
> new routine hugetlb_vmemmap_restore_folios() that will restore
> vmemmap for a list of folios.
> 
> This new routine must provide sufficient feedback about errors and
> actual restoration performed so that update_and_free_pages_bulk can
> perform optimally.
> 
> Special care must be taken when encountering an error from
> hugetlb_vmemmap_restore_folios.  We want to continue making as much
> forward progress as possible.  A new routine bulk_vmemmap_restore_error
> handles this specific situation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks for your continue working on this.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-09-25 23:48 ` [PATCH v6 2/8] hugetlb: restructure pool allocations Mike Kravetz
@ 2023-09-27 11:26   ` Konrad Dybcio
  2023-09-29 20:57     ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2023-09-27 11:26 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Andrew Morton,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel



On 26.09.2023 01:48, Mike Kravetz wrote:
> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> a low level allocator, prep the pages for usage as a hugetlb page and
> then add the resulting hugetlb page to the pool.
> 
> In the 'prep' stage, optional vmemmap optimization is done.  For
> performance reasons we want to perform vmemmap optimization on multiple
> hugetlb pages at once.  To do this, restructure the hugetlb pool
> allocation code such that vmemmap optimization can be isolated and later
> batched.
> 
> The code to allocate hugetlb pages from bootmem was also modified to
> allow batching.
> 
> No functional changes, only code restructure.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
Hi, looks like this patch prevents today's next from booting
on at least one Qualcomm ARM64 platform. Reverting it makes
the device boot again.

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-09-27 11:26   ` Konrad Dybcio
@ 2023-09-29 20:57     ` Mike Kravetz
  2023-10-02  9:57       ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-09-29 20:57 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Andrew Morton,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel

On 09/27/23 13:26, Konrad Dybcio wrote:
> 
> 
> On 26.09.2023 01:48, Mike Kravetz wrote:
> > Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > a low level allocator, prep the pages for usage as a hugetlb page and
> > then add the resulting hugetlb page to the pool.
> > 
> > In the 'prep' stage, optional vmemmap optimization is done.  For
> > performance reasons we want to perform vmemmap optimization on multiple
> > hugetlb pages at once.  To do this, restructure the hugetlb pool
> > allocation code such that vmemmap optimization can be isolated and later
> > batched.
> > 
> > The code to allocate hugetlb pages from bootmem was also modified to
> > allow batching.
> > 
> > No functional changes, only code restructure.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> Hi, looks like this patch prevents today's next from booting
> on at least one Qualcomm ARM64 platform. Reverting it makes
> the device boot again.

Can you share the config used and any other specific information such as
kernel command line.

I can not reproduce on the arm64 platforms I have.  Been trying various
config combinations without success.  Although, there are lots of
possibilities.  Also, taking a closer look at the changes.  So far,
nothing is obvious.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 4/8] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-25 23:48 ` [PATCH v6 4/8] hugetlb: perform vmemmap restoration " Mike Kravetz
  2023-09-26  2:27   ` Muchun Song
@ 2023-09-29 22:10   ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-09-29 22:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton

On 09/25/23 16:48, Mike Kravetz wrote:
<snip>
> +static void update_and_free_pages_bulk(struct hstate *h,
> +						struct list_head *folio_list)
> +{
> +	long ret;
> +	struct folio *folio, *t_folio;
> +	LIST_HEAD(non_hvo_folios);
>  
>  	/*
> -	 * If vmemmmap allocation was performed on any folio above, take lock
> -	 * to clear destructor of all folios on list.  This avoids the need to
> -	 * lock/unlock for each individual folio.
> -	 * The assumption is vmemmap allocation was performed on all or none
> -	 * of the folios on the list.  This is true expect in VERY rare cases.
> +	 * First allocate required vmemmmap (if necessary) for all folios.
> +	 * Carefully handle errors and free up any available hugetlb pages
> +	 * in an effort to make forward progress.
>  	 */
> -	if (clear_dtor) {
> +retry:
> +	ret = hugetlb_vmemmap_restore_folios(h, folio_list, &non_hvo_folios);
> +	if (ret < 0) {
> +		bulk_vmemmap_restore_error(h, folio_list, &non_hvo_folios);
> +		goto retry;
> +	}
> +
> +	/*
> +	 * At this point, list should be empty, ret should be >= 0 and there
> +	 * should only be pages on the non_hvo_folios list.
> +	 * Do note that the non_hvo_folios list could be empty.
> +	 * Without HVO enabled, ret will be 0 and there is no need to call
> +	 * __clear_hugetlb_destructor as this was done previously.
> +	 */
> +	VM_WARN_ON(!list_empty(folio_list));
> +	VM_WARN_ON(ret < 0);
> +	if (!list_empty(&non_hvo_folios) && ret) {
>  		spin_lock_irq(&hugetlb_lock);
> -		list_for_each_entry(folio, list, lru)
> +		list_for_each_entry(folio, &non_hvo_folios, lru)
>  			__clear_hugetlb_destructor(h, folio);
>  		spin_unlock_irq(&hugetlb_lock);
>  	}
>  
> -	/*
> -	 * Free folios back to low level allocators.  vmemmap and destructors
> -	 * were taken care of above, so update_and_free_hugetlb_folio will
> -	 * not need to take hugetlb lock.
> -	 */
> -	list_for_each_entry_safe(folio, t_folio, list, lru) {
> +	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
>  		update_and_free_hugetlb_folio(h, folio, false);
>  		cond_resched();
>  	}
<snip>
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index c512e388dbb4..0b7710f90e38 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -19,6 +19,9 @@
>  
>  #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +long hugetlb_vmemmap_restore_folios(const struct hstate *h,
> +					struct list_head *folio_list,
> +					struct list_head *non_hvo_folios);
>  void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
>  void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>  
> @@ -45,6 +48,13 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
>  	return 0;
>  }
>  
> +static long hugetlb_vmemmap_restore_folios(const struct hstate *h,
> +					struct list_head *folio_list,
> +					struct list_head *non_hvo_folios)
> +{
> +	return 0;
> +}

update_and_free_pages_bulk depends on pages with complete vmemmap being
moved from folio_list to non_hvo_folios.  In the case where we return 0,
it expects ALL pages to be moved.  Therefore, in the case where
!CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP the stub above must perform

	list_splice_init(folio_list, non_hvo_folios);

before returning 0.

I will update and send a new version along with any changes needed to
address the arm64 boot issue reported with patch 2.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-09-29 20:57     ` Mike Kravetz
@ 2023-10-02  9:57       ` Konrad Dybcio
  2023-10-06  3:08         ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-02  9:57 UTC (permalink / raw)
  To: Mike Kravetz, Konrad Dybcio
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Andrew Morton,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel



On 9/29/23 22:57, Mike Kravetz wrote:
> On 09/27/23 13:26, Konrad Dybcio wrote:
>>
>>
>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>> then add the resulting hugetlb page to the pool.
>>>
>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>> performance reasons we want to perform vmemmap optimization on multiple
>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>> allocation code such that vmemmap optimization can be isolated and later
>>> batched.
>>>
>>> The code to allocate hugetlb pages from bootmem was also modified to
>>> allow batching.
>>>
>>> No functional changes, only code restructure.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>> ---
>> Hi, looks like this patch prevents today's next from booting
>> on at least one Qualcomm ARM64 platform. Reverting it makes
>> the device boot again.
> 
> Can you share the config used and any other specific information such as
> kernel command line.
Later this week.

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-02  9:57       ` Konrad Dybcio
@ 2023-10-06  3:08         ` Mike Kravetz
  2023-10-06 21:39           ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-10-06  3:08 UTC (permalink / raw)
  To: Konrad Dybcio, Andrew Morton
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

On 10/02/23 11:57, Konrad Dybcio wrote:
> 
> 
> On 9/29/23 22:57, Mike Kravetz wrote:
> > On 09/27/23 13:26, Konrad Dybcio wrote:
> > > 
> > > 
> > > On 26.09.2023 01:48, Mike Kravetz wrote:
> > > > Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > > > alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > > > a low level allocator, prep the pages for usage as a hugetlb page and
> > > > then add the resulting hugetlb page to the pool.
> > > > 
> > > > In the 'prep' stage, optional vmemmap optimization is done.  For
> > > > performance reasons we want to perform vmemmap optimization on multiple
> > > > hugetlb pages at once.  To do this, restructure the hugetlb pool
> > > > allocation code such that vmemmap optimization can be isolated and later
> > > > batched.
> > > > 
> > > > The code to allocate hugetlb pages from bootmem was also modified to
> > > > allow batching.
> > > > 
> > > > No functional changes, only code restructure.
> > > > 
> > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > Hi, looks like this patch prevents today's next from booting
> > > on at least one Qualcomm ARM64 platform. Reverting it makes
> > > the device boot again.
> > 
> > Can you share the config used and any other specific information such as
> > kernel command line.
> Later this week.

As mentioned, I have been unable to reproduce on arm64 platforms I can
access.  I have tried various config and boot options.  While doing so,
I came across one issue impacting kernels compiled without
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
that would prevent booting.

I will send out an updated version series in the hope that any other
issues may be discovered.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-06  3:08         ` Mike Kravetz
@ 2023-10-06 21:39           ` Konrad Dybcio
  2023-10-06 22:35             ` Mike Kravetz
  2023-10-07  1:51             ` Jane Chu
  0 siblings, 2 replies; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-06 21:39 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

On 6.10.2023 05:08, Mike Kravetz wrote:
> On 10/02/23 11:57, Konrad Dybcio wrote:
>>
>>
>> On 9/29/23 22:57, Mike Kravetz wrote:
>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>
>>>>
>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>>>> then add the resulting hugetlb page to the pool.
>>>>>
>>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>>>> performance reasons we want to perform vmemmap optimization on multiple
>>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>>>> allocation code such that vmemmap optimization can be isolated and later
>>>>> batched.
>>>>>
>>>>> The code to allocate hugetlb pages from bootmem was also modified to
>>>>> allow batching.
>>>>>
>>>>> No functional changes, only code restructure.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>>>> ---
>>>> Hi, looks like this patch prevents today's next from booting
>>>> on at least one Qualcomm ARM64 platform. Reverting it makes
>>>> the device boot again.
>>>
>>> Can you share the config used and any other specific information such as
>>> kernel command line.
>> Later this week.
> 
> As mentioned, I have been unable to reproduce on arm64 platforms I can
> access.  I have tried various config and boot options.  While doing so,
> I came across one issue impacting kernels compiled without
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> that would prevent booting.
> 
> I will send out an updated version series in the hope that any other
> issues may be discovered.
I'm pushing the "later this week" by answering near end of calendar
day, Friday, but it seems like this patch in v7 still prevents the
device from booting..

You can find my defconfig at the link below.

https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-06 21:39           ` Konrad Dybcio
@ 2023-10-06 22:35             ` Mike Kravetz
  2023-10-09  3:29               ` Mike Kravetz
  2023-10-07  1:51             ` Jane Chu
  1 sibling, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-10-06 22:35 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Andrew Morton, Anshuman Khandual, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel

On 10/06/23 23:39, Konrad Dybcio wrote:
> On 6.10.2023 05:08, Mike Kravetz wrote:
> > On 10/02/23 11:57, Konrad Dybcio wrote:
> >>
> >>
> >> On 9/29/23 22:57, Mike Kravetz wrote:
> >>> On 09/27/23 13:26, Konrad Dybcio wrote:
> >>>>
> >>>>
> >>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> >>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> >>>>> a low level allocator, prep the pages for usage as a hugetlb page and
> >>>>> then add the resulting hugetlb page to the pool.
> >>>>>
> >>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
> >>>>> performance reasons we want to perform vmemmap optimization on multiple
> >>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
> >>>>> allocation code such that vmemmap optimization can be isolated and later
> >>>>> batched.
> >>>>>
> >>>>> The code to allocate hugetlb pages from bootmem was also modified to
> >>>>> allow batching.
> >>>>>
> >>>>> No functional changes, only code restructure.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> >>>>> ---
> >>>> Hi, looks like this patch prevents today's next from booting
> >>>> on at least one Qualcomm ARM64 platform. Reverting it makes
> >>>> the device boot again.
> >>>
> >>> Can you share the config used and any other specific information such as
> >>> kernel command line.
> >> Later this week.
> > 
> > As mentioned, I have been unable to reproduce on arm64 platforms I can
> > access.  I have tried various config and boot options.  While doing so,
> > I came across one issue impacting kernels compiled without
> > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> > that would prevent booting.
> > 
> > I will send out an updated version series in the hope that any other
> > issues may be discovered.
> I'm pushing the "later this week" by answering near end of calendar
> day, Friday, but it seems like this patch in v7 still prevents the
> device from booting..
> 
> You can find my defconfig at the link below.
> 
> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> 

Thanks!

I assume there is no further information such as any console output?
Did any of you other arm64 platforms have this issue?

Just trying to get as much information as possible to get to root cause.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-06 21:39           ` Konrad Dybcio
  2023-10-06 22:35             ` Mike Kravetz
@ 2023-10-07  1:51             ` Jane Chu
  2023-10-09 10:13               ` Konrad Dybcio
  1 sibling, 1 reply; 31+ messages in thread
From: Jane Chu @ 2023-10-07  1:51 UTC (permalink / raw)
  To: Konrad Dybcio, Mike Kravetz, Andrew Morton
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

Hi, Konrad,

Just wondering, is your arm64 system a VM instance, or a bare metal?

thanks!
-jane


On 10/6/2023 2:39 PM, Konrad Dybcio wrote:
> On 6.10.2023 05:08, Mike Kravetz wrote:
>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>
>>>
>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>
[..]
>> I will send out an updated version series in the hope that any other
>> issues may be discovered.
> I'm pushing the "later this week" by answering near end of calendar
> day, Friday, but it seems like this patch in v7 still prevents the
> device from booting..
> 
> You can find my defconfig at the link below.
> 
> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> 
> Konrad
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-06 22:35             ` Mike Kravetz
@ 2023-10-09  3:29               ` Mike Kravetz
  2023-10-09 10:11                 ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-10-09  3:29 UTC (permalink / raw)
  To: Konrad Dybcio, Anshuman Khandual
  Cc: Andrew Morton, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

On 10/06/23 15:35, Mike Kravetz wrote:
> On 10/06/23 23:39, Konrad Dybcio wrote:
> > On 6.10.2023 05:08, Mike Kravetz wrote:
> > > On 10/02/23 11:57, Konrad Dybcio wrote:
> > >>
> > >>
> > >> On 9/29/23 22:57, Mike Kravetz wrote:
> > >>> On 09/27/23 13:26, Konrad Dybcio wrote:
> > >>>>
> > >>>>
> > >>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> > >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > >>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > >>>>> a low level allocator, prep the pages for usage as a hugetlb page and
> > >>>>> then add the resulting hugetlb page to the pool.
> > >>>>>
> > >>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
> > >>>>> performance reasons we want to perform vmemmap optimization on multiple
> > >>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
> > >>>>> allocation code such that vmemmap optimization can be isolated and later
> > >>>>> batched.
> > >>>>>
> > >>>>> The code to allocate hugetlb pages from bootmem was also modified to
> > >>>>> allow batching.
> > >>>>>
> > >>>>> No functional changes, only code restructure.
> > >>>>>
> > >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > >>>>> ---
> > >>>> Hi, looks like this patch prevents today's next from booting
> > >>>> on at least one Qualcomm ARM64 platform. Reverting it makes
> > >>>> the device boot again.
> > >>>
> > >>> Can you share the config used and any other specific information such as
> > >>> kernel command line.
> > >> Later this week.
> > > 
> > > As mentioned, I have been unable to reproduce on arm64 platforms I can
> > > access.  I have tried various config and boot options.  While doing so,
> > > I came across one issue impacting kernels compiled without
> > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> > > that would prevent booting.
> > > 
> > > I will send out an updated version series in the hope that any other
> > > issues may be discovered.
> > I'm pushing the "later this week" by answering near end of calendar
> > day, Friday, but it seems like this patch in v7 still prevents the
> > device from booting..
> > 
> > You can find my defconfig at the link below.
> > 
> > https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> > 
> 
> Thanks!
> 
> I assume there is no further information such as any console output?
> Did any of you other arm64 platforms have this issue?
> 
> Just trying to get as much information as possible to get to root cause.

I have not had success isolating the issue with your config file.

Since the only code changes in this patch deal with allocating hugetlb
pages, I assume this is what you are doing?  Can you let me know how you
are performing the allocations?  I assume it is on the kernel command
line as these would be processed earliest in boot.

If you are not allocating hugetlb pages, then I need to think of what
else may be happening.

Anshuman, any chance you (or someone else with access to arm64 platforms)
could throw this on any platforms you have access to for a quick test?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09  3:29               ` Mike Kravetz
@ 2023-10-09 10:11                 ` Konrad Dybcio
  2023-10-09 15:04                   ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-09 10:11 UTC (permalink / raw)
  To: Mike Kravetz, Anshuman Khandual
  Cc: Andrew Morton, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

On 9.10.2023 05:29, Mike Kravetz wrote:
> On 10/06/23 15:35, Mike Kravetz wrote:
>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>
>>>>>
>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>>>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>>>>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>>>>>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>>>>>>> then add the resulting hugetlb page to the pool.
>>>>>>>>
>>>>>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>>>>>>> performance reasons we want to perform vmemmap optimization on multiple
>>>>>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>>>>>>> allocation code such that vmemmap optimization can be isolated and later
>>>>>>>> batched.
>>>>>>>>
>>>>>>>> The code to allocate hugetlb pages from bootmem was also modified to
>>>>>>>> allow batching.
>>>>>>>>
>>>>>>>> No functional changes, only code restructure.
>>>>>>>>
>>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>>>>>>> ---
>>>>>>> Hi, looks like this patch prevents today's next from booting
>>>>>>> on at least one Qualcomm ARM64 platform. Reverting it makes
>>>>>>> the device boot again.
>>>>>>
>>>>>> Can you share the config used and any other specific information such as
>>>>>> kernel command line.
>>>>> Later this week.
>>>>
>>>> As mentioned, I have been unable to reproduce on arm64 platforms I can
>>>> access.  I have tried various config and boot options.  While doing so,
>>>> I came across one issue impacting kernels compiled without
>>>> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
>>>> that would prevent booting.
>>>>
>>>> I will send out an updated version series in the hope that any other
>>>> issues may be discovered.
>>> I'm pushing the "later this week" by answering near end of calendar
>>> day, Friday, but it seems like this patch in v7 still prevents the
>>> device from booting..
>>>
>>> You can find my defconfig at the link below.
>>>
>>> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
>>>
>>
>> Thanks!
>>
>> I assume there is no further information such as any console output?
>> Did any of you other arm64 platforms have this issue?
>>
>> Just trying to get as much information as possible to get to root cause.
> 
> I have not had success isolating the issue with your config file.
> 
> Since the only code changes in this patch deal with allocating hugetlb
> pages, I assume this is what you are doing?  Can you let me know how you
> are performing the allocations?  I assume it is on the kernel command
> line as these would be processed earliest in boot.
> 
> If you are not allocating hugetlb pages, then I need to think of what
> else may be happening.
> 
> Anshuman, any chance you (or someone else with access to arm64 platforms)
> could throw this on any platforms you have access to for a quick test?
I managed to get a boot log:

https://pastebin.com/GwurpCw9

This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-07  1:51             ` Jane Chu
@ 2023-10-09 10:13               ` Konrad Dybcio
  0 siblings, 0 replies; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-09 10:13 UTC (permalink / raw)
  To: Jane Chu, Mike Kravetz, Andrew Morton
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel

On 7.10.2023 03:51, Jane Chu wrote:
> Hi, Konrad,
> 
> Just wondering, is your arm64 system a VM instance, or a bare metal?
That's a tricky question :)

Qualcomm platforms expose much of the hardware in a manner that's
similar to a VM, there's an extensive irreplaceable hypervisor in
place and the user can only boot Linux at EL1..

So, I guess the answer here is "somewhat bare metal" :/

Konrad
> 
> thanks!
> -jane
> 
> 
> On 10/6/2023 2:39 PM, Konrad Dybcio wrote:
>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>
>>>>
>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>
> [..]
>>> I will send out an updated version series in the hope that any other
>>> issues may be discovered.
>> I'm pushing the "later this week" by answering near end of calendar
>> day, Friday, but it seems like this patch in v7 still prevents the
>> device from booting..
>>
>> You can find my defconfig at the link below.
>>
>> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
>>
>> Konrad
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 10:11                 ` Konrad Dybcio
@ 2023-10-09 15:04                   ` Mike Kravetz
  2023-10-09 15:15                     ` Mike Kravetz
  2023-10-09 21:09                     ` Konrad Dybcio
  0 siblings, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-10-09 15:04 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Anshuman Khandual, Andrew Morton, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel

On 10/09/23 12:11, Konrad Dybcio wrote:
> On 9.10.2023 05:29, Mike Kravetz wrote:
> > On 10/06/23 15:35, Mike Kravetz wrote:
> >> On 10/06/23 23:39, Konrad Dybcio wrote:
> >>> On 6.10.2023 05:08, Mike Kravetz wrote:
> >>>> On 10/02/23 11:57, Konrad Dybcio wrote:
> >>>>> On 9/29/23 22:57, Mike Kravetz wrote:
> >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
> >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>
> I managed to get a boot log:
> 
> https://pastebin.com/GwurpCw9
> 
> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> 

Early on in boot log before the panic, I see this in the log:

[    0.000000] efi: UEFI not found.
[    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!

Isn't that misalignment pretty serious?  Or, is is possible to run with that?

There are no hugetlb pages allocated at boot time:

[    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm 

So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
noop.  The first thing it does is:
list_for_each_entry(m, &huge_boot_pages, list) {
...
}

However, huge_boot_pages should be empty as initialized here:
__initdata LIST_HEAD(huge_boot_pages);

At the end of the routine, we call prep_and_add_bootmem_folios to
process the local list created withing that above loop:

LIST_HEAD(folio_list);

This should also be empty and a noop.

Is it possible that the misaligned kernel image could make these lists
appear as non-empty?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 15:04                   ` Mike Kravetz
@ 2023-10-09 15:15                     ` Mike Kravetz
  2023-10-09 21:09                       ` Konrad Dybcio
  2023-10-10  0:07                       ` Andrew Morton
  2023-10-09 21:09                     ` Konrad Dybcio
  1 sibling, 2 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-10-09 15:15 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Anshuman Khandual, Andrew Morton, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel

On 10/09/23 08:04, Mike Kravetz wrote:
> On 10/09/23 12:11, Konrad Dybcio wrote:
> > On 9.10.2023 05:29, Mike Kravetz wrote:
> > > On 10/06/23 15:35, Mike Kravetz wrote:
> > >> On 10/06/23 23:39, Konrad Dybcio wrote:
> > >>> On 6.10.2023 05:08, Mike Kravetz wrote:
> > >>>> On 10/02/23 11:57, Konrad Dybcio wrote:
> > >>>>> On 9/29/23 22:57, Mike Kravetz wrote:
> > >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
> > >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> >
> > I managed to get a boot log:
> > 
> > https://pastebin.com/GwurpCw9
> > 
> > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> > 
> 
> Early on in boot log before the panic, I see this in the log:
> 
> [    0.000000] efi: UEFI not found.
> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> 
> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
> 
> There are no hugetlb pages allocated at boot time:
> 
> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm 
> 
> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> noop.  The first thing it does is:
> list_for_each_entry(m, &huge_boot_pages, list) {
> ...
> }
> 
> However, huge_boot_pages should be empty as initialized here:
> __initdata LIST_HEAD(huge_boot_pages);
> 
> At the end of the routine, we call prep_and_add_bootmem_folios to
> process the local list created withing that above loop:
> 
> LIST_HEAD(folio_list);
> 
> This should also be empty and a noop.
> 
> Is it possible that the misaligned kernel image could make these lists
> appear as non-empty?

Actually, just saw this:

https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/

Will take a look, although as mentioned above prep_and_add_bootmem_folios on
an empty list should be a noop.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 15:04                   ` Mike Kravetz
  2023-10-09 15:15                     ` Mike Kravetz
@ 2023-10-09 21:09                     ` Konrad Dybcio
  1 sibling, 0 replies; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-09 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Anshuman Khandual, Andrew Morton, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel



On 10/9/23 17:04, Mike Kravetz wrote:
> On 10/09/23 12:11, Konrad Dybcio wrote:
>> On 9.10.2023 05:29, Mike Kravetz wrote:
>>> On 10/06/23 15:35, Mike Kravetz wrote:
>>>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>
>> I managed to get a boot log:
>>
>> https://pastebin.com/GwurpCw9
>>
>> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
>>
> 
> Early on in boot log before the panic, I see this in the log:
> 
> [    0.000000] efi: UEFI not found.
> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> 
> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
That has never caused any issues and sadly I can't do anything about it.

> 
> There are no hugetlb pages allocated at boot time:
> 
> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
> 
> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> noop.  The first thing it does is:
> list_for_each_entry(m, &huge_boot_pages, list) {
> ...
> }
> 
> However, huge_boot_pages should be empty as initialized here:
> __initdata LIST_HEAD(huge_boot_pages);
> 
> At the end of the routine, we call prep_and_add_bootmem_folios to
> process the local list created withing that above loop:
> 
> LIST_HEAD(folio_list);
> 
> This should also be empty and a noop.
> 
> Is it possible that the misaligned kernel image could make these lists
> appear as non-empty?
I don't think I have an answer for this

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 15:15                     ` Mike Kravetz
@ 2023-10-09 21:09                       ` Konrad Dybcio
  2023-10-10  1:26                         ` Mike Kravetz
  2023-10-10  0:07                       ` Andrew Morton
  1 sibling, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-09 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Anshuman Khandual, Andrew Morton, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel



On 10/9/23 17:15, Mike Kravetz wrote:
> On 10/09/23 08:04, Mike Kravetz wrote:
>> On 10/09/23 12:11, Konrad Dybcio wrote:
>>> On 9.10.2023 05:29, Mike Kravetz wrote:
>>>> On 10/06/23 15:35, Mike Kravetz wrote:
>>>>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>>>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>
>>> I managed to get a boot log:
>>>
>>> https://pastebin.com/GwurpCw9
>>>
>>> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
>>>
>>
>> Early on in boot log before the panic, I see this in the log:
>>
>> [    0.000000] efi: UEFI not found.
>> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
>>
>> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
>>
>> There are no hugetlb pages allocated at boot time:
>>
>> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
>>
>> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
>> noop.  The first thing it does is:
>> list_for_each_entry(m, &huge_boot_pages, list) {
>> ...
>> }
>>
>> However, huge_boot_pages should be empty as initialized here:
>> __initdata LIST_HEAD(huge_boot_pages);
>>
>> At the end of the routine, we call prep_and_add_bootmem_folios to
>> process the local list created withing that above loop:
>>
>> LIST_HEAD(folio_list);
>>
>> This should also be empty and a noop.
>>
>> Is it possible that the misaligned kernel image could make these lists
>> appear as non-empty?
> 
> Actually, just saw this:
> 
> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> 
> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> an empty list should be a noop.
I'll try it out atop the series tomorrow or so.

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 15:15                     ` Mike Kravetz
  2023-10-09 21:09                       ` Konrad Dybcio
@ 2023-10-10  0:07                       ` Andrew Morton
  2023-10-10 21:30                         ` Konrad Dybcio
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2023-10-10  0:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Konrad Dybcio, Anshuman Khandual, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel, Usama Arif

On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > This should also be empty and a noop.
> > 
> > Is it possible that the misaligned kernel image could make these lists
> > appear as non-empty?
> 
> Actually, just saw this:
> 
> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> 
> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> an empty list should be a noop.

Konrad, are you able to test Usama's patch?  Thanks.

From: Usama Arif <usama.arif@bytedance.com>
Subject: mm: hugetlb: only prep and add allocated folios for non-gigantic pages
Date: Mon, 9 Oct 2023 15:56:05 +0100

Calling prep_and_add_allocated_folios when allocating gigantic pages at
boot time causes the kernel to crash as folio_list is empty and iterating
it causes a NULL pointer dereference.  Call this only for non-gigantic
pages when folio_list has entries.

Link: https://lkml.kernel.org/r/20231009145605.2150897-1-usama.arif@bytedance.com
Fixes: bfb41d6b2fe148 ("hugetlb: restructure pool allocations")
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb-restructure-pool-allocations-fix
+++ a/mm/hugetlb.c
@@ -3307,7 +3307,8 @@ static void __init hugetlb_hstate_alloc_
 	}
 
 	/* list will be empty if hstate_is_gigantic */
-	prep_and_add_allocated_folios(h, &folio_list);
+	if (!hstate_is_gigantic(h))
+		prep_and_add_allocated_folios(h, &folio_list);
 
 	if (i < h->max_huge_pages) {
 		char buf[32];
_


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-09 21:09                       ` Konrad Dybcio
@ 2023-10-10  1:26                         ` Mike Kravetz
  0 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2023-10-10  1:26 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Anshuman Khandual, Andrew Morton, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel

On 10/09/23 23:09, Konrad Dybcio wrote:
> 
> 
> On 10/9/23 17:15, Mike Kravetz wrote:
> > On 10/09/23 08:04, Mike Kravetz wrote:
> > > On 10/09/23 12:11, Konrad Dybcio wrote:
> > > > On 9.10.2023 05:29, Mike Kravetz wrote:
> > > > > On 10/06/23 15:35, Mike Kravetz wrote:
> > > > > > On 10/06/23 23:39, Konrad Dybcio wrote:
> > > > > > > On 6.10.2023 05:08, Mike Kravetz wrote:
> > > > > > > > On 10/02/23 11:57, Konrad Dybcio wrote:
> > > > > > > > > On 9/29/23 22:57, Mike Kravetz wrote:
> > > > > > > > > > On 09/27/23 13:26, Konrad Dybcio wrote:
> > > > > > > > > > > On 26.09.2023 01:48, Mike Kravetz wrote:
> > > > 
> > > > I managed to get a boot log:
> > > > 
> > > > https://pastebin.com/GwurpCw9
> > > > 
> > > > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> > > > 
> > > 
> > > Early on in boot log before the panic, I see this in the log:
> > > 
> > > [    0.000000] efi: UEFI not found.
> > > [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> > > 
> > > Isn't that misalignment pretty serious?  Or, is is possible to run with that?
> > > 
> > > There are no hugetlb pages allocated at boot time:
> > > 
> > > [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
> > > 
> > > So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> > > noop.  The first thing it does is:
> > > list_for_each_entry(m, &huge_boot_pages, list) {
> > > ...
> > > }
> > > 
> > > However, huge_boot_pages should be empty as initialized here:
> > > __initdata LIST_HEAD(huge_boot_pages);
> > > 
> > > At the end of the routine, we call prep_and_add_bootmem_folios to
> > > process the local list created withing that above loop:
> > > 
> > > LIST_HEAD(folio_list);
> > > 
> > > This should also be empty and a noop.
> > > 
> > > Is it possible that the misaligned kernel image could make these lists
> > > appear as non-empty?
> > 
> > Actually, just saw this:
> > 
> > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> > 
> > Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> > an empty list should be a noop.
> I'll try it out atop the series tomorrow or so.

I just replied to Usama's patch.  This may have more to do with IRQ enablement.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-10  0:07                       ` Andrew Morton
@ 2023-10-10 21:30                         ` Konrad Dybcio
  2023-10-10 21:45                           ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-10 21:30 UTC (permalink / raw)
  To: Andrew Morton, Mike Kravetz
  Cc: Anshuman Khandual, Xiongchun Duan, Barry Song, David Rientjes,
	Miaohe Lin, Matthew Wilcox, linux-mm, Naoya Horiguchi,
	Joao Martins, David Hildenbrand, Michal Hocko, Oscar Salvador,
	linux-kernel, Usama Arif



On 10/10/23 02:07, Andrew Morton wrote:
> On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> This should also be empty and a noop.
>>>
>>> Is it possible that the misaligned kernel image could make these lists
>>> appear as non-empty?
>>
>> Actually, just saw this:
>>
>> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
>>
>> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
>> an empty list should be a noop.
> 
> Konrad, are you able to test Usama's patch?  Thanks.
I legitimately spent a sad amount of time trying to regain access to the 
remote board farm. Previously I could hit the bug on SM8550, but now I 
can't do it on SM8450, SM8350 and SM8250 (previous gens), with the same 
config.. I have no idea when I'll be able to get access to SM8550 again.

I did test it on the QCM6490 Fairphone 5 that I initially reported this 
on, and neither booting next-20231010 (with your patchset applied) nor 
adding the below patch on top of it seems to work. I can't get serial 
output from this device though to find out what it's unhappy about :/

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-10 21:30                         ` Konrad Dybcio
@ 2023-10-10 21:45                           ` Mike Kravetz
  2023-10-11  9:36                             ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2023-10-10 21:45 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Andrew Morton, Anshuman Khandual, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel, Usama Arif

On 10/10/23 23:30, Konrad Dybcio wrote:
> 
> 
> On 10/10/23 02:07, Andrew Morton wrote:
> > On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > > > This should also be empty and a noop.
> > > > 
> > > > Is it possible that the misaligned kernel image could make these lists
> > > > appear as non-empty?
> > > 
> > > Actually, just saw this:
> > > 
> > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> > > 
> > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> > > an empty list should be a noop.
> > 
> > Konrad, are you able to test Usama's patch?  Thanks.
> I legitimately spent a sad amount of time trying to regain access to the
> remote board farm. Previously I could hit the bug on SM8550, but now I can't
> do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I
> have no idea when I'll be able to get access to SM8550 again.
> 
> I did test it on the QCM6490 Fairphone 5 that I initially reported this on,
> and neither booting next-20231010 (with your patchset applied) nor adding
> the below patch on top of it seems to work. I can't get serial output from
> this device though to find out what it's unhappy about :/

Sorry for causing you to spend so much time on this.

As mentioned in the reply to Usama's patch, the root cause seems to be
the locking.  So, the real change to test is the locking changes in
that thread; not Usama's patch.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/8] hugetlb: restructure pool allocations
  2023-10-10 21:45                           ` Mike Kravetz
@ 2023-10-11  9:36                             ` Konrad Dybcio
  0 siblings, 0 replies; 31+ messages in thread
From: Konrad Dybcio @ 2023-10-11  9:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Anshuman Khandual, Xiongchun Duan, Barry Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, linux-mm,
	Naoya Horiguchi, Joao Martins, David Hildenbrand, Michal Hocko,
	Oscar Salvador, linux-kernel, Usama Arif



On 10/10/23 23:45, Mike Kravetz wrote:
> On 10/10/23 23:30, Konrad Dybcio wrote:
>>
>>
>> On 10/10/23 02:07, Andrew Morton wrote:
>>> On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>>> This should also be empty and a noop.
>>>>>
>>>>> Is it possible that the misaligned kernel image could make these lists
>>>>> appear as non-empty?
>>>>
>>>> Actually, just saw this:
>>>>
>>>> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
>>>>
>>>> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
>>>> an empty list should be a noop.
>>>
>>> Konrad, are you able to test Usama's patch?  Thanks.
>> I legitimately spent a sad amount of time trying to regain access to the
>> remote board farm. Previously I could hit the bug on SM8550, but now I can't
>> do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I
>> have no idea when I'll be able to get access to SM8550 again.
>>
>> I did test it on the QCM6490 Fairphone 5 that I initially reported this on,
>> and neither booting next-20231010 (with your patchset applied) nor adding
>> the below patch on top of it seems to work. I can't get serial output from
>> this device though to find out what it's unhappy about :/
> 
> Sorry for causing you to spend so much time on this.
No worries, that was my explanation for why it took me so long to 
respond again..

> 
> As mentioned in the reply to Usama's patch, the root cause seems to be
> the locking.  So, the real change to test is the locking changes in
> that thread; not Usama's patch.
Ack

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-10-11  9:36 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-25 23:48 [PATCH v6 0/8] Batch hugetlb vmemmap modification operations Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 2/8] hugetlb: restructure pool allocations Mike Kravetz
2023-09-27 11:26   ` Konrad Dybcio
2023-09-29 20:57     ` Mike Kravetz
2023-10-02  9:57       ` Konrad Dybcio
2023-10-06  3:08         ` Mike Kravetz
2023-10-06 21:39           ` Konrad Dybcio
2023-10-06 22:35             ` Mike Kravetz
2023-10-09  3:29               ` Mike Kravetz
2023-10-09 10:11                 ` Konrad Dybcio
2023-10-09 15:04                   ` Mike Kravetz
2023-10-09 15:15                     ` Mike Kravetz
2023-10-09 21:09                       ` Konrad Dybcio
2023-10-10  1:26                         ` Mike Kravetz
2023-10-10  0:07                       ` Andrew Morton
2023-10-10 21:30                         ` Konrad Dybcio
2023-10-10 21:45                           ` Mike Kravetz
2023-10-11  9:36                             ` Konrad Dybcio
2023-10-09 21:09                     ` Konrad Dybcio
2023-10-07  1:51             ` Jane Chu
2023-10-09 10:13               ` Konrad Dybcio
2023-09-25 23:48 ` [PATCH v6 3/8] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 4/8] hugetlb: perform vmemmap restoration " Mike Kravetz
2023-09-26  2:27   ` Muchun Song
2023-09-29 22:10   ` Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 5/8] hugetlb: batch freeing of vmemmap pages Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 6/8] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 7/8] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
2023-09-25 23:48 ` [PATCH v6 8/8] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
2023-09-26  2:20   ` Muchun Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.