All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Batch hugetlb vmemmap modification operations
@ 2023-09-05 21:43 Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
                   ` (10 more replies)
  0 siblings, 11 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1].  The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower.  Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x!  Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing.  Each time we modify the kernel pagetable we need to flush
the TLB.  For each hugetlb that is optimized, there could be potentially
two TLB flushes performed.  One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split.  The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded.  So, the flush
time increases with the number of CPUs.  In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive.  Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages.  Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages.  One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages.  Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on next-20230905.
The first 4 patches of the series are modifications currently going into
the mm tree that modify the same area.  They are not directly related to
the batching changes.
Patch 5 (hugetlb: restructure pool allocations) is where batching changes
begin.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
  pages with hugetlb flag set can be passed to vmemmmap routines.  This
  involved separating the 'prep' of hugetlb pages even further.  The code
  dealing with bootmem allocations was also modified so that batching is
  possible.  Adding a 'batch' of hugetlb pages to their respective free
  lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
  (Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
  for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
  is being performed (Muchun).  Removed unused variable and
  hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
  allow for future expansion (Muchun).  Single flag VMEMMAP_NO_TLB_FLUSH.
  Provide detailed comment about the need to keep old and new vmemmap pages
  in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).

Joao Martins (2):
  hugetlb: batch PMD split for bulk vmemmap dedup
  hugetlb: batch TLB flushes when freeing vmemmap

Matthew Wilcox (Oracle) (3):
  hugetlb: Use a folio in free_hpage_workfn()
  hugetlb: Remove a few calls to page_folio()
  hugetlb: Convert remove_pool_huge_page() to
    remove_pool_hugetlb_folio()

Mike Kravetz (6):
  hugetlb: set hugetlb page flag before optimizing vmemmap
  hugetlb: restructure pool allocations
  hugetlb: perform vmemmap optimization on a list of pages
  hugetlb: perform vmemmap restoration on a list of pages
  hugetlb: batch freeing of vmemmap pages
  hugetlb: batch TLB flushes when restoring vmemmap

 mm/hugetlb.c         | 283 +++++++++++++++++++++++++++++++------------
 mm/hugetlb_vmemmap.c | 210 ++++++++++++++++++++++++++------
 mm/hugetlb_vmemmap.h |  10 ++
 3 files changed, 387 insertions(+), 116 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-06  0:48   ` Matthew Wilcox
  2023-10-13 12:58   ` Naoya Horiguchi
  2023-09-05 21:44 ` [PATCH v2 02/11] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Currently, vmemmap optimization of hugetlb pages is performed before the
hugetlb flag (previously hugetlb destructor) is set identifying it as a
hugetlb folio.  This means there is a window of time where an ordinary
folio does not have all associated vmemmap present.  The core mm only
expects vmemmap to be potentially optimized for hugetlb  and device dax.
This can cause problems in code such as memory error handling that may
want to write to tail struct pages.

There is only one call to perform hugetlb vmemmap optimization today.
To fix this issue, simply set the hugetlb flag before that call.

There was a similar issue in the free hugetlb path that was previously
addressed.  The two routines that optimize or restore hugetlb vmemmap
should only be passed hugetlb folios/pages.  To catch any callers not
following this rule, add VM_WARN_ON calls to the routines.  In the
hugetlb free code paths, some calls could be made to restore vmemmap
after clearing the hugetlb flag.  This was 'safe' as in these cases
vmemmap was already present and the call was a NOOP.  However, for
consistency these calls where eliminated so that we can add the
VM_WARN_ON checks.

Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         | 31 ++++++++++++++++++++++---------
 mm/hugetlb_vmemmap.c |  3 +++
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ba6d39b71cb1..c32ca241df4b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1720,7 +1720,12 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 	if (folio_test_hugetlb_raw_hwp_unreliable(folio))
 		return;
 
-	if (hugetlb_vmemmap_restore(h, &folio->page)) {
+	/*
+	 * If folio is not vmemmap optimized (!clear_dtor), then the folio
+	 * is no longer identified as a hugetlb page.  hugetlb_vmemmap_restore
+	 * can only be passed hugetlb pages and will BUG otherwise.
+	 */
+	if (clear_dtor && hugetlb_vmemmap_restore(h, &folio->page)) {
 		spin_lock_irq(&hugetlb_lock);
 		/*
 		 * If we cannot allocate vmemmap pages, just refuse to free the
@@ -1930,9 +1935,9 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 
 static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
+	folio_set_hugetlb(folio);
 	hugetlb_vmemmap_optimize(h, &folio->page);
 	INIT_LIST_HEAD(&folio->lru);
-	folio_set_hugetlb(folio);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
@@ -3580,13 +3585,21 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
 	remove_hugetlb_folio_for_demote(h, folio, false);
 	spin_unlock_irq(&hugetlb_lock);
 
-	rc = hugetlb_vmemmap_restore(h, &folio->page);
-	if (rc) {
-		/* Allocation of vmemmmap failed, we can not demote folio */
-		spin_lock_irq(&hugetlb_lock);
-		folio_ref_unfreeze(folio, 1);
-		add_hugetlb_folio(h, folio, false);
-		return rc;
+	/*
+	 * If vmemmap already existed for folio, the remove routine above would
+	 * have cleared the hugetlb folio flag.  Hence the folio is technically
+	 * no longer a hugetlb folio.  hugetlb_vmemmap_restore can only be
+	 * passed hugetlb folios and will BUG otherwise.
+	 */
+	if (folio_test_hugetlb(folio)) {
+		rc = hugetlb_vmemmap_restore(h, &folio->page);
+		if (rc) {
+			/* Allocation of vmemmmap failed, we can not demote folio */
+			spin_lock_irq(&hugetlb_lock);
+			folio_ref_unfreeze(folio, 1);
+			add_hugetlb_folio(h, folio, false);
+			return rc;
+		}
 	}
 
 	/*
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4b9734777f69..aeb7dd889eee 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -13,6 +13,7 @@
 #include <linux/pgtable.h>
 #include <linux/moduleparam.h>
 #include <linux/bootmem_info.h>
+#include <linux/mmdebug.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "hugetlb_vmemmap.h"
@@ -456,6 +457,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
 
+	VM_WARN_ON_ONCE(!PageHuge(head));
 	if (!HPageVmemmapOptimized(head))
 		return 0;
 
@@ -550,6 +552,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
 
+	VM_WARN_ON_ONCE(!PageHuge(head));
 	if (!vmemmap_should_optimize(h, head))
 		return;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 02/11] hugetlb: Use a folio in free_hpage_workfn()
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 03/11] hugetlb: Remove a few calls to page_folio() Mike Kravetz
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

update_and_free_hugetlb_folio puts the memory on hpage_freelist as a folio
so we can take it off the list as a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c32ca241df4b..a27fcff3350f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1787,22 +1787,22 @@ static void free_hpage_workfn(struct work_struct *work)
 	node = llist_del_all(&hpage_freelist);
 
 	while (node) {
-		struct page *page;
+		struct folio *folio;
 		struct hstate *h;
 
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
+		folio = container_of((struct address_space **)node,
+				     struct folio, mapping);
 		node = node->next;
-		page->mapping = NULL;
+		folio->mapping = NULL;
 		/*
 		 * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in
 		 * folio_hstate() is going to trigger because a previous call to
 		 * remove_hugetlb_folio() will clear the hugetlb bit, so do
 		 * not use folio_hstate() directly.
 		 */
-		h = size_to_hstate(page_size(page));
+		h = size_to_hstate(folio_size(folio));
 
-		__update_and_free_hugetlb_folio(h, page_folio(page));
+		__update_and_free_hugetlb_folio(h, folio);
 
 		cond_resched();
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 03/11] hugetlb: Remove a few calls to page_folio()
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 02/11] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 04/11] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Anything found on a linked list threaded through ->lru is guaranteed to
be a folio as the compound_head found in a tail page overlaps the ->lru
member of struct page.  So we can pull folios directly off these lists
no matter whether pages or folios were added to the list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a27fcff3350f..f768fe9aebad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1836,11 +1836,9 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 
 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
-	struct page *page, *t_page;
-	struct folio *folio;
+	struct folio *folio, *t_folio;
 
-	list_for_each_entry_safe(page, t_page, list, lru) {
-		folio = page_folio(page);
+	list_for_each_entry_safe(folio, t_folio, list, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
 	}
@@ -2229,8 +2227,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	struct page *page = NULL;
-	struct folio *folio;
+	struct folio *folio = NULL;
 
 	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
@@ -2240,15 +2237,14 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			page = list_entry(h->hugepage_freelists[node].next,
-					  struct page, lru);
-			folio = page_folio(page);
+			folio = list_entry(h->hugepage_freelists[node].next,
+					  struct folio, lru);
 			remove_hugetlb_folio(h, folio, acct_surplus);
 			break;
 		}
 	}
 
-	return page;
+	return &folio->page;
 }
 
 /*
@@ -3364,15 +3360,15 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 * Collect pages to be freed on a list, and free after dropping lock
 	 */
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
+		struct folio *folio, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
-		list_for_each_entry_safe(page, next, freel, lru) {
+		list_for_each_entry_safe(folio, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				goto out;
-			if (PageHighMem(page))
+			if (folio_test_highmem(folio))
 				continue;
-			remove_hugetlb_folio(h, page_folio(page), false);
-			list_add(&page->lru, &page_list);
+			remove_hugetlb_folio(h, folio, false);
+			list_add(&folio->lru, &page_list);
 		}
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 04/11] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio()
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (2 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 03/11] hugetlb: Remove a few calls to page_folio() Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 05/11] hugetlb: restructure pool allocations Mike Kravetz
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Convert the callers to expect a folio and remove the unnecesary conversion
back to a struct page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f768fe9aebad..278c8ae6a36c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1446,7 +1446,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for remove_pool_huge_page() - return the previously saved
+ * helper for remove_pool_hugetlb_folio() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -2222,9 +2222,8 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
  * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static struct page *remove_pool_huge_page(struct hstate *h,
-						nodemask_t *nodes_allowed,
-						 bool acct_surplus)
+static struct folio *remove_pool_hugetlb_folio(struct hstate *h,
+		nodemask_t *nodes_allowed, bool acct_surplus)
 {
 	int nr_nodes, node;
 	struct folio *folio = NULL;
@@ -2244,7 +2243,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 		}
 	}
 
-	return &folio->page;
+	return folio;
 }
 
 /*
@@ -2598,7 +2597,6 @@ static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
-	struct page *page;
 	LIST_HEAD(page_list);
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -2619,15 +2617,17 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * remove_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_hugetlb_folio() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
-		if (!page)
+		struct folio *folio;
+
+		folio = remove_pool_hugetlb_folio(h, &node_states[N_MEMORY], 1);
+		if (!folio)
 			goto out;
 
-		list_add(&page->lru, &page_list);
+		list_add(&folio->lru, &page_list);
 	}
 
 out:
@@ -3422,7 +3422,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
-	struct page *page;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3542,11 +3541,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		page = remove_pool_huge_page(h, nodes_allowed, 0);
-		if (!page)
+		struct folio *folio;
+
+		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
+		if (!folio)
 			break;
 
-		list_add(&page->lru, &page_list);
+		list_add(&folio->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
 	spin_unlock_irq(&hugetlb_lock);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 05/11] hugetlb: restructure pool allocations
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (3 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 04/11] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page.  This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done.  For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once.  To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

The code to allocate hugetlb pages from bootmem was also modified to
allow batching.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 183 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 144 insertions(+), 39 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 278c8ae6a36c..d1950b726346 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1931,16 +1931,21 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 	h->nr_huge_pages_node[nid]++;
 }
 
-static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	folio_set_hugetlb(folio);
-	hugetlb_vmemmap_optimize(h, &folio->page);
 	INIT_LIST_HEAD(&folio->lru);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
 }
 
+static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+{
+	init_new_hugetlb_folio(h, folio);
+	hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid)
 {
 	__prep_new_hugetlb_folio(h, folio);
@@ -2151,16 +2156,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	return page_folio(page);
 }
 
-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen':  ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+				gfp_t gfp_mask, int nid, nodemask_t *nmask,
+				nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
 	bool retry = false;
@@ -2173,6 +2171,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 				nid, nmask, node_alloc_noretry);
 	if (!folio)
 		return NULL;
+
 	if (hstate_is_gigantic(h)) {
 		if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			/*
@@ -2187,32 +2186,84 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 			return NULL;
 		}
 	}
-	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
 
 	return folio;
 }
 
+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
+{
+	struct folio *folio;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (folio)
+		init_new_hugetlb_folio(h, folio);
+	return folio;
+}
+
 /*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen':  ref count of head page and all tail
+ * pages is zero.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-				nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
-	int nr_nodes, node;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (!folio)
+		return NULL;
+
+	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+	return folio;
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+					struct list_head *folio_list)
+{
+	struct folio *folio, *tmp_f;
+
+	/*
+	 * Add all new pool pages to free lists in one lock cycle
+	 */
+	spin_lock_irq(&hugetlb_lock);
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		__prep_account_new_huge_page(h, folio_nid(folio));
+		enqueue_hugetlb_folio(h, folio);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+
+	INIT_LIST_HEAD(folio_list);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner.  The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+					nodemask_t *nodes_allowed,
+					nodemask_t *node_alloc_noretry)
+{
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nr_nodes, node;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+		struct folio *folio;
+
+		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
 					nodes_allowed, node_alloc_noretry);
-		if (folio) {
-			free_huge_folio(folio); /* free it into the hugepage allocator */
-			return 1;
-		}
+		if (folio)
+			return folio;
 	}
 
-	return 0;
+	return NULL;
 }
 
 /*
@@ -3178,19 +3229,29 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
  */
 static void __init gather_bootmem_prealloc(void)
 {
+	LIST_HEAD(folio_list);
 	struct huge_bootmem_page *m;
+	struct hstate *h, *prev_h = NULL;
 
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = page_folio(page);
-		struct hstate *h = m->hstate;
+
+		h = m->hstate;
+		/*
+		 * It is possible to gave multiple huge page sizes (hstates)
+		 * in this list.  If so, process each size separately.
+		 */
+		if (h != prev_h && prev_h != NULL)
+			prep_and_add_allocated_folios(prev_h, &folio_list);
+		prev_h = h;
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(folio_ref_count(folio) != 1);
 		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			WARN_ON(folio_test_reserved(folio));
-			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
-			free_huge_folio(folio); /* add to the hugepage allocator */
+			init_new_hugetlb_folio(h, folio);
+			list_add(&folio->lru, &folio_list);
 		} else {
 			/* VERY unlikely inflated ref count on a tail page */
 			free_gigantic_folio(folio, huge_page_order(h));
@@ -3204,6 +3265,8 @@ static void __init gather_bootmem_prealloc(void)
 		adjust_managed_page_count(page, pages_per_huge_page(h));
 		cond_resched();
 	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
 }
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 {
@@ -3236,9 +3299,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ *   pages are allocated from memblock allocated or something similar.
+ *   Gigantic pages are actually added to pools later with the routine
+ *   gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ *   all of mm is up and functional.  Pages are allocated from buddy and
+ *   then added to hugetlb pools.
+ */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
 	bool node_specific_alloc = false;
 
@@ -3280,14 +3356,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
+			/*
+			 * gigantic pages not added to list as they are not
+			 * added to pools now.
+			 */
 			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
-		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY],
-					 node_alloc_noretry))
-			break;
+		} else {
+			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+							node_alloc_noretry);
+			if (!folio)
+				break;
+			list_add(&folio->lru, &folio_list);
+		}
 		cond_resched();
 	}
+
+	/* list will be empty if hstate_is_gigantic */
+	prep_and_add_allocated_folios(h, &folio_list);
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
@@ -3421,7 +3508,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
-	unsigned long min_count, ret;
+	unsigned long min_count;
+	unsigned long allocated;
+	struct folio *folio;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3496,7 +3585,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	allocated = 0;
+	while (count > (persistent_huge_pages(h) + allocated)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_folio will handle it by freeing the page
@@ -3507,15 +3597,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed,
+		folio = alloc_pool_huge_folio(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock_irq(&hugetlb_lock);
-		if (!ret)
+		if (!folio) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		list_add(&folio->lru, &page_list);
+		allocated++;
 
 		/* Bail for signals. Probably ctrl-c from user */
-		if (signal_pending(current))
+		if (signal_pending(current)) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		spin_lock_irq(&hugetlb_lock);
+	}
+
+	/* Add allocated pages to the pool */
+	if (!list_empty(&page_list)) {
+		spin_unlock_irq(&hugetlb_lock);
+		prep_and_add_allocated_folios(h, &page_list);
+		spin_lock_irq(&hugetlb_lock);
 	}
 
 	/*
@@ -3541,8 +3648,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		struct folio *folio;
-
 		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
 		if (!folio)
 			break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (4 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 05/11] hugetlb: restructure pool allocations Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-06  7:30   ` Muchun Song
  2023-09-05 21:44 ` [PATCH v2 07/11] hugetlb: perform vmemmap restoration " Mike Kravetz
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When adding hugetlb pages to the pool, we first create a list of the
allocated pages before adding to the pool.  Pass this list of pages to a
new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.

We also modify the routine vmemmap_should_optimize() to check for pages
that are already optimized.  There are code paths that might request
vmemmap optimization twice and we want to make sure this is not
attempted.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         |  5 +++++
 mm/hugetlb_vmemmap.c | 11 +++++++++++
 mm/hugetlb_vmemmap.h |  5 +++++
 3 files changed, 21 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d1950b726346..554be94b07bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2230,6 +2230,11 @@ static void prep_and_add_allocated_folios(struct hstate *h,
 {
 	struct folio *folio, *tmp_f;
 
+	/*
+	 * Send list for bulk vmemmap optimization processing
+	 */
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
+
 	/*
 	 * Add all new pool pages to free lists in one lock cycle
 	 */
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index aeb7dd889eee..ac5577d372fe 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -484,6 +484,9 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
+	if (HPageVmemmapOptimized((struct page *)head))
+		return false;
+
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return false;
 
@@ -573,6 +576,14 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 		SetHPageVmemmapOptimized(head);
 }
 
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+	struct folio *folio;
+
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 25bd0e002431..036494e040ca 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -13,6 +13,7 @@
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
 /*
  * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
@@ -47,6 +48,10 @@ static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page
 {
 }
 
+static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+}
+
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (5 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-06  7:33   ` Muchun Song
  2023-09-05 21:44 ` [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages Mike Kravetz
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When removing hugetlb pages from the pool, we first create a list
of removed pages and then free those pages back to low level allocators.
Part of the 'freeing process' is to restore vmemmap for all base pages
if necessary.  Pass this list of pages to a new routine
hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
performed in bulk.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         |  3 +++
 mm/hugetlb_vmemmap.c | 13 +++++++++++++
 mm/hugetlb_vmemmap.h |  5 +++++
 3 files changed, 21 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 554be94b07bd..dd2dbc256172 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
 	struct folio *folio, *t_folio;
 
+	/* First restore vmemmap for all pages on list. */
+	hugetlb_vmemmap_restore_folios(h, list);
+
 	list_for_each_entry_safe(folio, t_folio, list, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ac5577d372fe..79de984919ef 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+/*
+ * This function will attempt to resore vmemmap for a list of folios.  There
+ * is no guarantee that restoration will be successful for all or any folios.
+ * This is used in bulk operations, and no feedback is given to the caller.
+ */
+void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
+{
+	struct folio *folio;
+
+	list_for_each_entry(folio, folio_list, lru)
+		(void)hugetlb_vmemmap_restore(h, &folio->page);
+}
+
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 036494e040ca..b4ee945dc1d4 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,6 +12,7 @@
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
@@ -44,6 +45,10 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
 	return 0;
 }
 
+static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
+{
+}
+
 static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (6 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 07/11] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-06  7:38   ` Muchun Song
  2023-09-05 21:44 ` [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Now that batching of hugetlb vmemmap optimization processing is possible,
batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
hugetlb page, we add them to a list that is freed after the entire batch
has been processed.

This enhances the ability to return contiguous ranges of memory to the
low level allocators.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 60 ++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 79de984919ef..a715712df831 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -306,18 +306,21 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
+ *		responsibility to free pages.
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
-			      unsigned long reuse)
+			      unsigned long reuse,
+			      struct list_head *vmemmap_pages)
 {
 	int ret;
-	LIST_HEAD(vmemmap_pages);
+	LIST_HEAD(freed_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
-		.vmemmap_pages	= &vmemmap_pages,
+		.vmemmap_pages	= &freed_pages,
 	};
 	int nid = page_to_nid((struct page *)start);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
@@ -335,7 +338,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 	if (walk.reuse_page) {
 		copy_page(page_to_virt(walk.reuse_page),
 			  (void *)walk.reuse_addr);
-		list_add(&walk.reuse_page->lru, &vmemmap_pages);
+		list_add(&walk.reuse_page->lru, &freed_pages);
 	}
 
 	/*
@@ -366,15 +369,14 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		walk = (struct vmemmap_remap_walk) {
 			.remap_pte	= vmemmap_restore_pte,
 			.reuse_addr	= reuse,
-			.vmemmap_pages	= &vmemmap_pages,
+			.vmemmap_pages	= &freed_pages,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
 	}
 	mmap_read_unlock(&init_mm);
 
-	free_vmemmap_page_list(&vmemmap_pages);
-
+	list_splice(&freed_pages, vmemmap_pages);
 	return ret;
 }
 
@@ -553,17 +555,9 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
 	return true;
 }
 
-/**
- * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be optimized.
- *
- * This function only tries to optimize @head's vmemmap pages and does not
- * guarantee that the optimization will succeed after it returns. The caller
- * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
- * have been optimized.
- */
-void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+static void __hugetlb_vmemmap_optimize(const struct hstate *h,
+					struct page *head,
+					struct list_head *vmemmap_pages)
 {
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
@@ -580,21 +574,43 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 
 	/*
 	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
-	 * to the page which @vmemmap_reuse is mapped to, then free the pages
-	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
+	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
+	 * mapping the range to vmemmap_pages list so that they can be freed by
+	 * the caller.
 	 */
-	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
+	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages))
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
 	else
 		SetHPageVmemmapOptimized(head);
 }
 
+/**
+ * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be optimized.
+ *
+ * This function only tries to optimize @head's vmemmap pages and does not
+ * guarantee that the optimization will succeed after it returns. The caller
+ * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
+ * have been optimized.
+ */
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+{
+	LIST_HEAD(vmemmap_pages);
+
+	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+	free_vmemmap_page_list(&vmemmap_pages);
+}
+
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
+	LIST_HEAD(vmemmap_pages);
 
 	list_for_each_entry(folio, folio_list, lru)
-		hugetlb_vmemmap_optimize(h, &folio->page);
+		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+
+	free_vmemmap_page_list(&vmemmap_pages);
 }
 
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (7 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-06  8:24   ` Muchun Song
  2023-09-05 21:44 ` [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
  2023-09-05 21:44 ` [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

In an effort to minimize amount of TLB flushes, batch all PMD splits
belonging to a range of pages in order to perform only 1 (global) TLB
flush.

Rebased and updated by Mike Kravetz

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 68 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a715712df831..d956551699bc 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
 	struct list_head	*vmemmap_pages;
 };
 
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
 {
 	pmd_t __pmd;
 	int i;
@@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
 		pmd_populate_kernel(&init_mm, pmd, pgtable);
-		flush_tlb_kernel_range(start, start + PMD_SIZE);
+		if (flush)
+			flush_tlb_kernel_range(start, start + PMD_SIZE);
 	} else {
 		pte_free_kernel(&init_mm, pgtable);
 	}
@@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
 	do {
 		int ret;
 
-		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
+				walk->remap_pte != NULL);
 		if (ret)
 			return ret;
 
 		next = pmd_addr_end(addr, end);
+
+		/*
+		 * We are only splitting, not remapping the hugetlb vmemmap
+		 * pages.
+		 */
+		if (!walk->remap_pte)
+			continue;
+
 		vmemmap_pte_range(pmd, addr, next, walk);
 	} while (pmd++, addr = next, addr != end);
 
@@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	flush_tlb_kernel_range(start, end);
+	if (walk->remap_pte)
+		flush_tlb_kernel_range(start, end);
 
 	return 0;
 }
@@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
 }
 
+/**
+ * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
+ *                      backing PMDs of the directmap into PTEs
+ * @start:     start address of the vmemmap virtual address range that we want
+ *             to remap.
+ * @end:       end address of the vmemmap virtual address range that we want to
+ *             remap.
+ * @reuse:     reuse address.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_split(unsigned long start, unsigned long end,
+				unsigned long reuse)
+{
+	int ret;
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= NULL,
+	};
+
+	/* See the comment in the vmemmap_remap_free(). */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	mmap_read_lock(&init_mm);
+	ret = vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
+
+	return ret;
+}
+
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
  *			to the page which @reuse is mapped to, then free vmemmap
@@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
+static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
+{
+	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
+	unsigned long vmemmap_reuse;
+
+	if (!vmemmap_should_optimize(h, head))
+		return;
+
+	vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
+	vmemmap_reuse   = vmemmap_start;
+	vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
+
+	/*
+	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
+	 * @vmemmap_end]
+	 */
+	vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+}
+
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
 	LIST_HEAD(vmemmap_pages);
 
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_split(h, &folio->page);
+
+	flush_tlb_all();
+
 	list_for_each_entry(folio, folio_list, lru)
 		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (8 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-07  6:55   ` Muchun Song
  2023-09-05 21:44 ` [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

Now that a list of pages is deduplicated at once, the TLB
flush can be batched for all vmemmap pages that got remapped.

Add a flags field and pass whether it's a bulk allocation or
just a single page to decide to remap.

The TLB flush is global as we don't have guarantees from caller
that the set of folios is contiguous, or to add complexity in
composing a list of kVAs to flush.

Modified by Mike Kravetz to perform TLB flush on single folio if an
error is encountered.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d956551699bc..8c85e2c38538 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -27,6 +27,7 @@
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
  *			or is mapped from.
+ * @flags:		used to modify behavior in bulk operations
  */
 struct vmemmap_remap_walk {
 	void			(*remap_pte)(pte_t *pte, unsigned long addr,
@@ -35,6 +36,8 @@ struct vmemmap_remap_walk {
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
+#define VMEMMAP_NO_TLB_FLUSH		BIT(0)
+	unsigned long		flags;
 };
 
 static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
@@ -208,7 +211,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	if (walk->remap_pte)
+	if (walk->remap_pte && !(walk->flags & VMEMMAP_NO_TLB_FLUSH))
 		flush_tlb_kernel_range(start, end);
 
 	return 0;
@@ -348,12 +351,14 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
  * @reuse:	reuse address.
  * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
  *		responsibility to free pages.
+ * @flags:	modifications to vmemmap_remap_walk flags
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
 			      unsigned long reuse,
-			      struct list_head *vmemmap_pages)
+			      struct list_head *vmemmap_pages,
+			      unsigned long flags)
 {
 	int ret;
 	LIST_HEAD(freed_pages);
@@ -361,6 +366,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &freed_pages,
+		.flags		= flags,
 	};
 	int nid = page_to_nid((struct page *)start);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
@@ -410,6 +416,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 			.remap_pte	= vmemmap_restore_pte,
 			.reuse_addr	= reuse,
 			.vmemmap_pages	= &freed_pages,
+			.flags		= 0,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
@@ -597,7 +604,8 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
 
 static void __hugetlb_vmemmap_optimize(const struct hstate *h,
 					struct page *head,
-					struct list_head *vmemmap_pages)
+					struct list_head *vmemmap_pages,
+					unsigned long flags)
 {
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
@@ -607,6 +615,18 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
 		return;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+	/*
+	 * Very Subtle
+	 * If VMEMMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
+	 * immediately after remapping.  As a result, subsequent accesses
+	 * and modifications to struct pages associated with the hugetlb
+	 * page could bet to the OLD struct pages.  Set the vmemmap optimized
+	 * flag here so that it is copied to the new head page.  This keeps
+	 * the old and new struct pages in sync.
+	 * If there is an error during optimization, we will immediately FLUSH
+	 * the TLB and clear the flag below.
+	 */
+	SetHPageVmemmapOptimized(head);
 
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 	vmemmap_reuse	= vmemmap_start;
@@ -618,10 +638,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
 	 * mapping the range to vmemmap_pages list so that they can be freed by
 	 * the caller.
 	 */
-	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages))
+	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages, flags)) {
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
-	else
-		SetHPageVmemmapOptimized(head);
+		ClearHPageVmemmapOptimized(head);
+	}
 }
 
 /**
@@ -638,7 +658,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages, 0UL);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -672,7 +692,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 	flush_tlb_all();
 
 	list_for_each_entry(folio, folio_list, lru)
-		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages, VMEMMAP_NO_TLB_FLUSH);
+
+	flush_tlb_all();
 
 	free_vmemmap_page_list(&vmemmap_pages);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap
  2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (9 preceding siblings ...)
  2023-09-05 21:44 ` [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
@ 2023-09-05 21:44 ` Mike Kravetz
  2023-09-07  6:58   ` Muchun Song
  10 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-05 21:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Update the hugetlb_vmemmap_restore path to take a 'batch' parameter that
indicates restoration is happening on a batch of pages.  When set, use
the existing mechanism (VMEMMAP_NO_TLB_FLUSH) to delay TLB flushing.
The routine hugetlb_vmemmap_restore_folios is the only user of this new
batch parameter and it will perform a global flush after all vmemmap is
restored.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8c85e2c38538..11fda9d061eb 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -458,17 +458,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @flags:	modify behavior for bulk operations
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			       unsigned long reuse)
+			       unsigned long reuse, unsigned long flags)
 {
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_restore_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
+		.flags		= flags,
 	};
 
 	/* See the comment in the vmemmap_remap_free(). */
@@ -490,17 +492,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
 
-/**
- * hugetlb_vmemmap_restore - restore previously optimized (by
- *			     hugetlb_vmemmap_optimize()) vmemmap pages which
- *			     will be reallocated and remapped.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be restored.
- *
- * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
- * negative error code otherwise.
- */
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
 {
 	int ret;
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -521,7 +513,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	 * When a HugeTLB page is freed to the buddy allocator, previously
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
-	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
+	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
 	if (!ret) {
 		ClearHPageVmemmapOptimized(head);
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -530,6 +522,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ *			     hugetlb_vmemmap_optimize()) vmemmap pages which
+ *			     will be reallocated and remapped.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
+ */
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+{
+	return __hugetlb_vmemmap_restore(h, head, 0UL);
+}
+
 /*
  * This function will attempt to resore vmemmap for a list of folios.  There
  * is no guarantee that restoration will be successful for all or any folios.
@@ -540,7 +547,9 @@ void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *fo
 	struct folio *folio;
 
 	list_for_each_entry(folio, folio_list, lru)
-		(void)hugetlb_vmemmap_restore(h, &folio->page);
+		(void)__hugetlb_vmemmap_restore(h, &folio->page, VMEMMAP_NO_TLB_FLUSH);
+
+	flush_tlb_all();
 }
 
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
@ 2023-09-06  0:48   ` Matthew Wilcox
  2023-09-06  1:05     ` Mike Kravetz
  2023-10-13 12:58   ` Naoya Horiguchi
  1 sibling, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2023-09-06  0:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Barry Song, Michal Hocko,
	Xiongchun Duan, Andrew Morton

On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> @@ -456,6 +457,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>  	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>  	unsigned long vmemmap_reuse;
>  
> +	VM_WARN_ON_ONCE(!PageHuge(head));
>  	if (!HPageVmemmapOptimized(head))
>  		return 0;
>  
> @@ -550,6 +552,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>  	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>  	unsigned long vmemmap_reuse;
>  
> +	VM_WARN_ON_ONCE(!PageHuge(head));
>  	if (!vmemmap_should_optimize(h, head))
>  		return;

Someone who's looking for an easy patch or three should convert both
of these functions to take a folio instead of a page.  All callers
pass &folio->page.  Obviously do that work on top of Mike's patch set
to avoid creating more work for him.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-09-06  0:48   ` Matthew Wilcox
@ 2023-09-06  1:05     ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-06  1:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Barry Song, Michal Hocko,
	Xiongchun Duan, Andrew Morton

On 09/06/23 01:48, Matthew Wilcox wrote:
> On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> > @@ -456,6 +457,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >  	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> >  	unsigned long vmemmap_reuse;
> >  
> > +	VM_WARN_ON_ONCE(!PageHuge(head));
> >  	if (!HPageVmemmapOptimized(head))
> >  		return 0;
> >  
> > @@ -550,6 +552,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> >  	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> >  	unsigned long vmemmap_reuse;
> >  
> > +	VM_WARN_ON_ONCE(!PageHuge(head));
> >  	if (!vmemmap_should_optimize(h, head))
> >  		return;
> 
> Someone who's looking for an easy patch or three should convert both
> of these functions to take a folio instead of a page.  All callers
> pass &folio->page.  Obviously do that work on top of Mike's patch set
> to avoid creating more work for him.

I think Muchun already suggested this.

It would make sense as this series is proposing two new routines taking
a list of folios:
- hugetlb_vmemmap_optimize_folios
- hugetlb_vmemmap_restore_folios

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages
  2023-09-05 21:44 ` [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
@ 2023-09-06  7:30   ` Muchun Song
  0 siblings, 0 replies; 44+ messages in thread
From: Muchun Song @ 2023-09-06  7:30 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> When adding hugetlb pages to the pool, we first create a list of the
> allocated pages before adding to the pool.  Pass this list of pages to a
> new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.
>
> We also modify the routine vmemmap_should_optimize() to check for pages
> that are already optimized.  There are code paths that might request
> vmemmap optimization twice and we want to make sure this is not
> attempted.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-05 21:44 ` [PATCH v2 07/11] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-09-06  7:33   ` Muchun Song
  2023-09-06  8:07     ` Muchun Song
  2023-09-06 20:53     ` Mike Kravetz
  0 siblings, 2 replies; 44+ messages in thread
From: Muchun Song @ 2023-09-06  7:33 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> When removing hugetlb pages from the pool, we first create a list
> of removed pages and then free those pages back to low level allocators.
> Part of the 'freeing process' is to restore vmemmap for all base pages
> if necessary.  Pass this list of pages to a new routine
> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> performed in bulk.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c         |  3 +++
>   mm/hugetlb_vmemmap.c | 13 +++++++++++++
>   mm/hugetlb_vmemmap.h |  5 +++++
>   3 files changed, 21 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 554be94b07bd..dd2dbc256172 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>   {
>   	struct folio *folio, *t_folio;
>   
> +	/* First restore vmemmap for all pages on list. */
> +	hugetlb_vmemmap_restore_folios(h, list);
> +
>   	list_for_each_entry_safe(folio, t_folio, list, lru) {
>   		update_and_free_hugetlb_folio(h, folio, false);
>   		cond_resched();
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ac5577d372fe..79de984919ef 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	return ret;
>   }
>   
> +/*
> + * This function will attempt to resore vmemmap for a list of folios.  There
> + * is no guarantee that restoration will be successful for all or any folios.
> + * This is used in bulk operations, and no feedback is given to the caller.
> + */
> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> +{
> +	struct folio *folio;
> +
> +	list_for_each_entry(folio, folio_list, lru)
> +		(void)hugetlb_vmemmap_restore(h, &folio->page);

I am curious about the purpose of "void" here, seems it it not necessnary,
ritgh? We cound see so many palces where we do not add the void if the 
caller
does not care about the return value of the callee.

Thanks.
> +}
> +
>   /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
>   static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
>   {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 036494e040ca..b4ee945dc1d4 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -12,6 +12,7 @@
>   
>   #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>   int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list);
>   void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>   
> @@ -44,6 +45,10 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
>   	return 0;
>   }
>   
> +static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> +{
> +}
> +
>   static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   {
>   }


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages
  2023-09-05 21:44 ` [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages Mike Kravetz
@ 2023-09-06  7:38   ` Muchun Song
  2023-09-06 21:38     ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-06  7:38 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> Now that batching of hugetlb vmemmap optimization processing is possible,
> batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
> hugetlb page, we add them to a list that is freed after the entire batch
> has been processed.
>
> This enhances the ability to return contiguous ranges of memory to the
> low level allocators.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 60 ++++++++++++++++++++++++++++----------------
>   1 file changed, 38 insertions(+), 22 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 79de984919ef..a715712df831 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -306,18 +306,21 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
>    * @reuse:	reuse address.
> + * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
> + *		responsibility to free pages.
>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> -			      unsigned long reuse)
> +			      unsigned long reuse,
> +			      struct list_head *vmemmap_pages)
>   {
>   	int ret;
> -	LIST_HEAD(vmemmap_pages);
> +	LIST_HEAD(freed_pages);

IIUC, we could reuse the parameter of @vmemmap_pages directly instead of
a temporary variable, could it be dropped?

>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= vmemmap_remap_pte,
>   		.reuse_addr	= reuse,
> -		.vmemmap_pages	= &vmemmap_pages,
> +		.vmemmap_pages	= &freed_pages,
>   	};
>   	int nid = page_to_nid((struct page *)start);
>   	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> @@ -335,7 +338,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   	if (walk.reuse_page) {
>   		copy_page(page_to_virt(walk.reuse_page),
>   			  (void *)walk.reuse_addr);
> -		list_add(&walk.reuse_page->lru, &vmemmap_pages);
> +		list_add(&walk.reuse_page->lru, &freed_pages);
>   	}
>   
>   	/*
> @@ -366,15 +369,14 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   		walk = (struct vmemmap_remap_walk) {
>   			.remap_pte	= vmemmap_restore_pte,
>   			.reuse_addr	= reuse,
> -			.vmemmap_pages	= &vmemmap_pages,
> +			.vmemmap_pages	= &freed_pages,
>   		};
>   
>   		vmemmap_remap_range(reuse, end, &walk);
>   	}
>   	mmap_read_unlock(&init_mm);
>   
> -	free_vmemmap_page_list(&vmemmap_pages);
> -
> +	list_splice(&freed_pages, vmemmap_pages);
>   	return ret;
>   }
>   
> @@ -553,17 +555,9 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
>   	return true;
>   }
>   
> -/**
> - * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> - * @h:		struct hstate.
> - * @head:	the head page whose vmemmap pages will be optimized.
> - *
> - * This function only tries to optimize @head's vmemmap pages and does not
> - * guarantee that the optimization will succeed after it returns. The caller
> - * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> - * have been optimized.
> - */
> -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +static void __hugetlb_vmemmap_optimize(const struct hstate *h,
> +					struct page *head,
> +					struct list_head *vmemmap_pages)
>   {
>   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>   	unsigned long vmemmap_reuse;
> @@ -580,21 +574,43 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   
>   	/*
>   	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> -	 * to the page which @vmemmap_reuse is mapped to, then free the pages
> -	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> +	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
> +	 * mapping the range to vmemmap_pages list so that they can be freed by
> +	 * the caller.
>   	 */
> -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
> +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages))
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
>   	else
>   		SetHPageVmemmapOptimized(head);
>   }
>   
> +/**
> + * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> + * @h:		struct hstate.
> + * @head:	the head page whose vmemmap pages will be optimized.
> + *
> + * This function only tries to optimize @head's vmemmap pages and does not
> + * guarantee that the optimization will succeed after it returns. The caller
> + * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> + * have been optimized.
> + */
> +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +{
> +	LIST_HEAD(vmemmap_pages);
> +
> +	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
> +	free_vmemmap_page_list(&vmemmap_pages);
> +}
> +
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
> +	LIST_HEAD(vmemmap_pages);
>   
>   	list_for_each_entry(folio, folio_list, lru)
> -		hugetlb_vmemmap_optimize(h, &folio->page);
> +		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
> +
> +	free_vmemmap_page_list(&vmemmap_pages);
>   }
>   
>   static struct ctl_table hugetlb_vmemmap_sysctls[] = {


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-06  7:33   ` Muchun Song
@ 2023-09-06  8:07     ` Muchun Song
  2023-09-06 21:12       ` Mike Kravetz
  2023-09-06 20:53     ` Mike Kravetz
  1 sibling, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-06  8:07 UTC (permalink / raw)
  To: Mike Kravetz, Linux-MM, LKML
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
> 
> 
> 
> On 2023/9/6 05:44, Mike Kravetz wrote:
>> When removing hugetlb pages from the pool, we first create a list
>> of removed pages and then free those pages back to low level allocators.
>> Part of the 'freeing process' is to restore vmemmap for all base pages
>> if necessary.  Pass this list of pages to a new routine
>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
>> performed in bulk.
>> 
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/hugetlb.c         |  3 +++
>>  mm/hugetlb_vmemmap.c | 13 +++++++++++++
>>  mm/hugetlb_vmemmap.h |  5 +++++
>>  3 files changed, 21 insertions(+)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 554be94b07bd..dd2dbc256172 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>  {
>>   struct folio *folio, *t_folio;
>>  + /* First restore vmemmap for all pages on list. */
>> + hugetlb_vmemmap_restore_folios(h, list);
>> +
>>   list_for_each_entry_safe(folio, t_folio, list, lru) {
>>   update_and_free_hugetlb_folio(h, folio, false);
>>   cond_resched();
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ac5577d372fe..79de984919ef 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>   return ret;
>>  }
>>  +/*
>> + * This function will attempt to resore vmemmap for a list of folios.  There
>> + * is no guarantee that restoration will be successful for all or any folios.
>> + * This is used in bulk operations, and no feedback is given to the caller.
>> + */
>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>> +{
>> + struct folio *folio;
>> +
>> + list_for_each_entry(folio, folio_list, lru)
>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
> 
> I am curious about the purpose of "void" here, seems it it not necessnary,
> ritgh? We cound see so many palces where we do not add the void if the caller
> does not care about the return value of the callee.

Another question: should we stop restoring vmemmap pages when
hugetlb_vmemmap_restore() fails? In which case, I suspect there
is no memory probably, there is no need to continue, right?

> 
> Thanks.
>> +}
>> +
>>  /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
>>  static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
>>  {
>> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
>> index 036494e040ca..b4ee945dc1d4 100644
>> --- a/mm/hugetlb_vmemmap.h
>> +++ b/mm/hugetlb_vmemmap.h
>> @@ -12,6 +12,7 @@
>>    #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>  int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list);
>>  void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
>>  void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>>  @@ -44,6 +45,10 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
>>   return 0;
>>  }
>>  +static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>> +{
>> +}
>> +
>>  static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>>  {
>>  }
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-05 21:44 ` [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
@ 2023-09-06  8:24   ` Muchun Song
  2023-09-06  9:11     ` [External] " Muchun Song
  2023-09-06  9:13     ` Joao Martins
  0 siblings, 2 replies; 44+ messages in thread
From: Muchun Song @ 2023-09-06  8:24 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
>
> In an effort to minimize amount of TLB flushes, batch all PMD splits
> belonging to a range of pages in order to perform only 1 (global) TLB
> flush.
>
> Rebased and updated by Mike Kravetz
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index a715712df831..d956551699bc 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>   	struct list_head	*vmemmap_pages;
>   };
>   
> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>   {
>   	pmd_t __pmd;
>   	int i;
> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>   		/* Make pte visible before pmd. See comment in pmd_install(). */
>   		smp_wmb();
>   		pmd_populate_kernel(&init_mm, pmd, pgtable);
> -		flush_tlb_kernel_range(start, start + PMD_SIZE);
> +		if (flush)
> +			flush_tlb_kernel_range(start, start + PMD_SIZE);
>   	} else {
>   		pte_free_kernel(&init_mm, pgtable);
>   	}
> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>   	do {
>   		int ret;
>   
> -		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> +		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
> +				walk->remap_pte != NULL);

It is bettter to only make @walk->remap_pte indicate whether we should go
to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
to indicate whether we should flush the TLB at pmd level. It'll be more 
clear.

>   		if (ret)
>   			return ret;
>   
>   		next = pmd_addr_end(addr, end);
> +
> +		/*
> +		 * We are only splitting, not remapping the hugetlb vmemmap
> +		 * pages.
> +		 */
> +		if (!walk->remap_pte)
> +			continue;
> +
>   		vmemmap_pte_range(pmd, addr, next, walk);
>   	} while (pmd++, addr = next, addr != end);
>   
> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>   			return ret;
>   	} while (pgd++, addr = next, addr != end);
>   
> -	flush_tlb_kernel_range(start, end);
> +	if (walk->remap_pte)
> +		flush_tlb_kernel_range(start, end);
>   
>   	return 0;
>   }
> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>   	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>   }
>   
> +/**
> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
> + *                      backing PMDs of the directmap into PTEs
> + * @start:     start address of the vmemmap virtual address range that we want
> + *             to remap.
> + * @end:       end address of the vmemmap virtual address range that we want to
> + *             remap.
> + * @reuse:     reuse address.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
> +				unsigned long reuse)
> +{
> +	int ret;
> +	struct vmemmap_remap_walk walk = {
> +		.remap_pte	= NULL,
> +	};
> +
> +	/* See the comment in the vmemmap_remap_free(). */
> +	BUG_ON(start - reuse != PAGE_SIZE);
> +
> +	mmap_read_lock(&init_mm);
> +	ret = vmemmap_remap_range(reuse, end, &walk);
> +	mmap_read_unlock(&init_mm);
> +
> +	return ret;
> +}
> +
>   /**
>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>    *			to the page which @reuse is mapped to, then free vmemmap
> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   	free_vmemmap_page_list(&vmemmap_pages);
>   }
>   
> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> +{
> +	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> +	unsigned long vmemmap_reuse;
> +
> +	if (!vmemmap_should_optimize(h, head))
> +		return;
> +
> +	vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
> +	vmemmap_reuse   = vmemmap_start;
> +	vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
> +
> +	/*
> +	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
> +	 * @vmemmap_end]
> +	 */
> +	vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> +}
> +
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
>   	LIST_HEAD(vmemmap_pages);
>   
> +	list_for_each_entry(folio, folio_list, lru)
> +		hugetlb_vmemmap_split(h, &folio->page);

Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
to indicate whether it has done successfully, if it fails, it must be
OOM, in which case, there is no sense to continue to split the page talbe
and optimize the vmemmap pages subsequently, right?

Thanks.

> +
> +	flush_tlb_all();
> +
>   	list_for_each_entry(folio, folio_list, lru)
>   		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
>   


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] Re: [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  8:24   ` Muchun Song
@ 2023-09-06  9:11     ` Muchun Song
  2023-09-06  9:26       ` Joao Martins
  2023-09-06  9:13     ` Joao Martins
  1 sibling, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-06  9:11 UTC (permalink / raw)
  To: Mike Kravetz, Joao Martins
  Cc: linux-mm, linux-kernel, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	muchun.song

On Wed, Sep 6, 2023 at 4:25 PM Muchun Song <muchun.song@linux.dev> wrote:
>
>
>
> On 2023/9/6 05:44, Mike Kravetz wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> >
> > In an effort to minimize amount of TLB flushes, batch all PMD splits
> > belonging to a range of pages in order to perform only 1 (global) TLB
> > flush.
> >
> > Rebased and updated by Mike Kravetz
> >
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
> >   1 file changed, 68 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index a715712df831..d956551699bc 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
> >       struct list_head        *vmemmap_pages;
> >   };
> >
> > -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> > +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
> >   {
> >       pmd_t __pmd;
> >       int i;
> > @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> >               /* Make pte visible before pmd. See comment in pmd_install(). */
> >               smp_wmb();
> >               pmd_populate_kernel(&init_mm, pmd, pgtable);
> > -             flush_tlb_kernel_range(start, start + PMD_SIZE);
> > +             if (flush)
> > +                     flush_tlb_kernel_range(start, start + PMD_SIZE);
> >       } else {
> >               pte_free_kernel(&init_mm, pgtable);
> >       }
> > @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> >       do {
> >               int ret;
> >
> > -             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> > +             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
> > +                             walk->remap_pte != NULL);
>
> It is bettter to only make @walk->remap_pte indicate whether we should go
> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
> to indicate whether we should flush the TLB at pmd level. It'll be more
> clear.
>
> >               if (ret)
> >                       return ret;
> >
> >               next = pmd_addr_end(addr, end);
> > +
> > +             /*
> > +              * We are only splitting, not remapping the hugetlb vmemmap
> > +              * pages.
> > +              */
> > +             if (!walk->remap_pte)
> > +                     continue;
> > +
> >               vmemmap_pte_range(pmd, addr, next, walk);
> >       } while (pmd++, addr = next, addr != end);
> >
> > @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
> >                       return ret;
> >       } while (pgd++, addr = next, addr != end);
> >
> > -     flush_tlb_kernel_range(start, end);
> > +     if (walk->remap_pte)
> > +             flush_tlb_kernel_range(start, end);
> >
> >       return 0;
> >   }
> > @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >       set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> >   }
> >
> > +/**
> > + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
> > + *                      backing PMDs of the directmap into PTEs
> > + * @start:     start address of the vmemmap virtual address range that we want
> > + *             to remap.
> > + * @end:       end address of the vmemmap virtual address range that we want to
> > + *             remap.
> > + * @reuse:     reuse address.
> > + *
> > + * Return: %0 on success, negative error code otherwise.
> > + */
> > +static int vmemmap_remap_split(unsigned long start, unsigned long end,
> > +                             unsigned long reuse)
> > +{
> > +     int ret;
> > +     struct vmemmap_remap_walk walk = {
> > +             .remap_pte      = NULL,
> > +     };
> > +
> > +     /* See the comment in the vmemmap_remap_free(). */
> > +     BUG_ON(start - reuse != PAGE_SIZE);
> > +
> > +     mmap_read_lock(&init_mm);
> > +     ret = vmemmap_remap_range(reuse, end, &walk);
> > +     mmap_read_unlock(&init_mm);
> > +
> > +     return ret;
> > +}
> > +
> >   /**
> >    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> >    *                  to the page which @reuse is mapped to, then free vmemmap
> > @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> >       free_vmemmap_page_list(&vmemmap_pages);
> >   }
> >
> > +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> > +{
> > +     unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> > +     unsigned long vmemmap_reuse;
> > +
> > +     if (!vmemmap_should_optimize(h, head))
> > +             return;
> > +
> > +     vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
> > +     vmemmap_reuse   = vmemmap_start;
> > +     vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
> > +
> > +     /*
> > +      * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
> > +      * @vmemmap_end]
> > +      */
> > +     vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> > +}
> > +
> >   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
> >   {
> >       struct folio *folio;
> >       LIST_HEAD(vmemmap_pages);
> >
> > +     list_for_each_entry(folio, folio_list, lru)
> > +             hugetlb_vmemmap_split(h, &folio->page);
>
> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
> to indicate whether it has done successfully, if it fails, it must be
> OOM, in which case, there is no sense to continue to split the page table
> and optimize the vmemmap pages subsequently, right?

Sorry, it is reasonable to continue to optimize the vmemmap pages
subsequently since it should succeed because those vmemmap pages
have been split successfully previously.

Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split()
fails, then we will have more memory to continue to split. But it will
make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to
hear you guys' opinions here.

Thanks.

>
> Thanks.
>
> > +
> > +     flush_tlb_all();
> > +
> >       list_for_each_entry(folio, folio_list, lru)
> >               __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
> >
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  8:24   ` Muchun Song
  2023-09-06  9:11     ` [External] " Muchun Song
@ 2023-09-06  9:13     ` Joao Martins
  1 sibling, 0 replies; 44+ messages in thread
From: Joao Martins @ 2023-09-06  9:13 UTC (permalink / raw)
  To: Muchun Song, Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton

On 06/09/2023 09:24, Muchun Song wrote:
> On 2023/9/6 05:44, Mike Kravetz wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>> belonging to a range of pages in order to perform only 1 (global) TLB
>> flush.
>>
>> Rebased and updated by Mike Kravetz
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 68 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index a715712df831..d956551699bc 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>>       struct list_head    *vmemmap_pages;
>>   };
>>   -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>>   {
>>       pmd_t __pmd;
>>       int i;
>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long
>> start)
>>           /* Make pte visible before pmd. See comment in pmd_install(). */
>>           smp_wmb();
>>           pmd_populate_kernel(&init_mm, pmd, pgtable);
>> -        flush_tlb_kernel_range(start, start + PMD_SIZE);
>> +        if (flush)
>> +            flush_tlb_kernel_range(start, start + PMD_SIZE);
>>       } else {
>>           pte_free_kernel(&init_mm, pgtable);
>>       }
>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long
>> addr,
>>       do {
>>           int ret;
>>   -        ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>> +        ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>> +                walk->remap_pte != NULL);
> 
> It is bettter to only make @walk->remap_pte indicate whether we should go
> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
> to indicate whether we should flush the TLB at pmd level. It'll be more clear.
> 
Part of the reason I did this was to differentiate between an explicit split()
from a split() occuring during a remap of a page. So we would batch flush on
split, while flush on each PMD on a remap. But OK, maybe this doesn't matter
much if we end up returning earlier down below as you suggest

>>           if (ret)
>>               return ret;
>>             next = pmd_addr_end(addr, end);
>> +
>> +        /*
>> +         * We are only splitting, not remapping the hugetlb vmemmap
>> +         * pages.
>> +         */
>> +        if (!walk->remap_pte)
>> +            continue;
>> +
>>           vmemmap_pte_range(pmd, addr, next, walk);
>>       } while (pmd++, addr = next, addr != end);
>>   @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start,
>> unsigned long end,
>>               return ret;
>>       } while (pgd++, addr = next, addr != end);
>>   -    flush_tlb_kernel_range(start, end);
>> +    if (walk->remap_pte)
>> +        flush_tlb_kernel_range(start, end);
>>         return 0;
>>   }
>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long
>> addr,
>>       set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>   }
>>   +/**
>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>> + *                      backing PMDs of the directmap into PTEs
>> + * @start:     start address of the vmemmap virtual address range that we want
>> + *             to remap.
>> + * @end:       end address of the vmemmap virtual address range that we want to
>> + *             remap.
>> + * @reuse:     reuse address.
>> + *
>> + * Return: %0 on success, negative error code otherwise.
>> + */
>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>> +                unsigned long reuse)
>> +{
>> +    int ret;
>> +    struct vmemmap_remap_walk walk = {
>> +        .remap_pte    = NULL,
>> +    };
>> +
>> +    /* See the comment in the vmemmap_remap_free(). */
>> +    BUG_ON(start - reuse != PAGE_SIZE);
>> +
>> +    mmap_read_lock(&init_mm);
>> +    ret = vmemmap_remap_range(reuse, end, &walk);
>> +    mmap_read_unlock(&init_mm);
>> +
>> +    return ret;
>> +}
>> +
>>   /**
>>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>    *            to the page which @reuse is mapped to, then free vmemmap
>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h,
>> struct page *head)
>>       free_vmemmap_page_list(&vmemmap_pages);
>>   }
>>   +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>> +{
>> +    unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>> +    unsigned long vmemmap_reuse;
>> +
>> +    if (!vmemmap_should_optimize(h, head))
>> +        return;
>> +
>> +    vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>> +    vmemmap_reuse   = vmemmap_start;
>> +    vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>> +
>> +    /*
>> +     * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>> +     * @vmemmap_end]
>> +     */
>> +    vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>> +}
>> +
>>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>> *folio_list)
>>   {
>>       struct folio *folio;
>>       LIST_HEAD(vmemmap_pages);
>>   +    list_for_each_entry(folio, folio_list, lru)
>> +        hugetlb_vmemmap_split(h, &folio->page);
> 
> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
> to indicate whether it has done successfully, if it fails, it must be
> OOM, in which case, there is no sense to continue to split the page talbe
> and optimize the vmemmap pages subsequently, right?
> 
I suppose that makes sense. hugetlb_vmemmap_split() already returns the error,
it's just testing and break the loop into flush_tlb_all()

> Thanks.
> 
>> +
>> +    flush_tlb_all();
>> +
>>       list_for_each_entry(folio, folio_list, lru)
>>           __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
>>   
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] Re: [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  9:11     ` [External] " Muchun Song
@ 2023-09-06  9:26       ` Joao Martins
  2023-09-06  9:32         ` [External] " Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2023-09-06  9:26 UTC (permalink / raw)
  To: Muchun Song, Mike Kravetz
  Cc: linux-mm, linux-kernel, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	muchun.song



On 06/09/2023 10:11, Muchun Song wrote:
> On Wed, Sep 6, 2023 at 4:25 PM Muchun Song <muchun.song@linux.dev> wrote:
>>
>>
>>
>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>
>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>> flush.
>>>
>>> Rebased and updated by Mike Kravetz
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>   mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>>>   1 file changed, 68 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index a715712df831..d956551699bc 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>>>       struct list_head        *vmemmap_pages;
>>>   };
>>>
>>> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>>>   {
>>>       pmd_t __pmd;
>>>       int i;
>>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>               /* Make pte visible before pmd. See comment in pmd_install(). */
>>>               smp_wmb();
>>>               pmd_populate_kernel(&init_mm, pmd, pgtable);
>>> -             flush_tlb_kernel_range(start, start + PMD_SIZE);
>>> +             if (flush)
>>> +                     flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>       } else {
>>>               pte_free_kernel(&init_mm, pgtable);
>>>       }
>>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>>>       do {
>>>               int ret;
>>>
>>> -             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>>> +             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>>> +                             walk->remap_pte != NULL);
>>
>> It is bettter to only make @walk->remap_pte indicate whether we should go
>> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
>> to indicate whether we should flush the TLB at pmd level. It'll be more
>> clear.
>>
>>>               if (ret)
>>>                       return ret;
>>>
>>>               next = pmd_addr_end(addr, end);
>>> +
>>> +             /*
>>> +              * We are only splitting, not remapping the hugetlb vmemmap
>>> +              * pages.
>>> +              */
>>> +             if (!walk->remap_pte)
>>> +                     continue;
>>> +
>>>               vmemmap_pte_range(pmd, addr, next, walk);
>>>       } while (pmd++, addr = next, addr != end);
>>>
>>> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>>                       return ret;
>>>       } while (pgd++, addr = next, addr != end);
>>>
>>> -     flush_tlb_kernel_range(start, end);
>>> +     if (walk->remap_pte)
>>> +             flush_tlb_kernel_range(start, end);
>>>
>>>       return 0;
>>>   }
>>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>       set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>>   }
>>>
>>> +/**
>>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>>> + *                      backing PMDs of the directmap into PTEs
>>> + * @start:     start address of the vmemmap virtual address range that we want
>>> + *             to remap.
>>> + * @end:       end address of the vmemmap virtual address range that we want to
>>> + *             remap.
>>> + * @reuse:     reuse address.
>>> + *
>>> + * Return: %0 on success, negative error code otherwise.
>>> + */
>>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>> +                             unsigned long reuse)
>>> +{
>>> +     int ret;
>>> +     struct vmemmap_remap_walk walk = {
>>> +             .remap_pte      = NULL,
>>> +     };
>>> +
>>> +     /* See the comment in the vmemmap_remap_free(). */
>>> +     BUG_ON(start - reuse != PAGE_SIZE);
>>> +
>>> +     mmap_read_lock(&init_mm);
>>> +     ret = vmemmap_remap_range(reuse, end, &walk);
>>> +     mmap_read_unlock(&init_mm);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>>   /**
>>>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>>    *                  to the page which @reuse is mapped to, then free vmemmap
>>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>>>       free_vmemmap_page_list(&vmemmap_pages);
>>>   }
>>>
>>> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>> +{
>>> +     unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>>> +     unsigned long vmemmap_reuse;
>>> +
>>> +     if (!vmemmap_should_optimize(h, head))
>>> +             return;
>>> +
>>> +     vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>>> +     vmemmap_reuse   = vmemmap_start;
>>> +     vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>> +
>>> +     /*
>>> +      * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>> +      * @vmemmap_end]
>>> +      */
>>> +     vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>> +}
>>> +
>>>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>>>   {
>>>       struct folio *folio;
>>>       LIST_HEAD(vmemmap_pages);
>>>
>>> +     list_for_each_entry(folio, folio_list, lru)
>>> +             hugetlb_vmemmap_split(h, &folio->page);
>>
>> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
>> to indicate whether it has done successfully, if it fails, it must be
>> OOM, in which case, there is no sense to continue to split the page table
>> and optimize the vmemmap pages subsequently, right?
> 
> Sorry, it is reasonable to continue to optimize the vmemmap pages
> subsequently since it should succeed because those vmemmap pages
> have been split successfully previously.
> 
> Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split()
> fails, then we will have more memory to continue to split. 

Good point

> But it will
> make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to
> hear you guys' opinions here.
> 
I think it won't add that much complexity if we don't optimize too much of the
slowpath (when we are out of memory). In the batch freeing patch we could
additionally test the return value of __hugetlb_vmemmap_optimize() for ENOMEM
and free the currently stored vmemmap_pages (if any), and keep iterating the
optimize loop. Should be simple enough and make this a bit more resilient to
that scenario. But we would need to keep the earlier check you commented above
(where we use @remap_pte to defer PMD flush).

> Thanks.
> 
>>
>> Thanks.
>>
>>> +
>>> +     flush_tlb_all();
>>> +
>>>       list_for_each_entry(folio, folio_list, lru)
>>>               __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
>>>
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  9:26       ` Joao Martins
@ 2023-09-06  9:32         ` Muchun Song
  2023-09-06  9:44           ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-06  9:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Muchun Song, Mike Kravetz, Linux-MM, LKML, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



> On Sep 6, 2023, at 17:26, Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> 
> 
> On 06/09/2023 10:11, Muchun Song wrote:
>> On Wed, Sep 6, 2023 at 4:25 PM Muchun Song <muchun.song@linux.dev> wrote:
>>> 
>>> 
>>> 
>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> 
>>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>>> flush.
>>>> 
>>>> Rebased and updated by Mike Kravetz
>>>> 
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>>  mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>>>>  1 file changed, 68 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>> index a715712df831..d956551699bc 100644
>>>> --- a/mm/hugetlb_vmemmap.c
>>>> +++ b/mm/hugetlb_vmemmap.c
>>>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>>>>      struct list_head        *vmemmap_pages;
>>>>  };
>>>> 
>>>> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>>>>  {
>>>>      pmd_t __pmd;
>>>>      int i;
>>>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>>              /* Make pte visible before pmd. See comment in pmd_install(). */
>>>>              smp_wmb();
>>>>              pmd_populate_kernel(&init_mm, pmd, pgtable);
>>>> -             flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>> +             if (flush)
>>>> +                     flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>>      } else {
>>>>              pte_free_kernel(&init_mm, pgtable);
>>>>      }
>>>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>>>>      do {
>>>>              int ret;
>>>> 
>>>> -             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>>>> +             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>>>> +                             walk->remap_pte != NULL);
>>> 
>>> It is bettter to only make @walk->remap_pte indicate whether we should go
>>> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
>>> to indicate whether we should flush the TLB at pmd level. It'll be more
>>> clear.
>>> 
>>>>              if (ret)
>>>>                      return ret;
>>>> 
>>>>              next = pmd_addr_end(addr, end);
>>>> +
>>>> +             /*
>>>> +              * We are only splitting, not remapping the hugetlb vmemmap
>>>> +              * pages.
>>>> +              */
>>>> +             if (!walk->remap_pte)
>>>> +                     continue;
>>>> +
>>>>              vmemmap_pte_range(pmd, addr, next, walk);
>>>>      } while (pmd++, addr = next, addr != end);
>>>> 
>>>> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>>>                      return ret;
>>>>      } while (pgd++, addr = next, addr != end);
>>>> 
>>>> -     flush_tlb_kernel_range(start, end);
>>>> +     if (walk->remap_pte)
>>>> +             flush_tlb_kernel_range(start, end);
>>>> 
>>>>      return 0;
>>>>  }
>>>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>>      set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>>>  }
>>>> 
>>>> +/**
>>>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>>>> + *                      backing PMDs of the directmap into PTEs
>>>> + * @start:     start address of the vmemmap virtual address range that we want
>>>> + *             to remap.
>>>> + * @end:       end address of the vmemmap virtual address range that we want to
>>>> + *             remap.
>>>> + * @reuse:     reuse address.
>>>> + *
>>>> + * Return: %0 on success, negative error code otherwise.
>>>> + */
>>>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>>> +                             unsigned long reuse)
>>>> +{
>>>> +     int ret;
>>>> +     struct vmemmap_remap_walk walk = {
>>>> +             .remap_pte      = NULL,
>>>> +     };
>>>> +
>>>> +     /* See the comment in the vmemmap_remap_free(). */
>>>> +     BUG_ON(start - reuse != PAGE_SIZE);
>>>> +
>>>> +     mmap_read_lock(&init_mm);
>>>> +     ret = vmemmap_remap_range(reuse, end, &walk);
>>>> +     mmap_read_unlock(&init_mm);
>>>> +
>>>> +     return ret;
>>>> +}
>>>> +
>>>>  /**
>>>>   * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>>>   *                  to the page which @reuse is mapped to, then free vmemmap
>>>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>>>>      free_vmemmap_page_list(&vmemmap_pages);
>>>>  }
>>>> 
>>>> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>>> +{
>>>> +     unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>>>> +     unsigned long vmemmap_reuse;
>>>> +
>>>> +     if (!vmemmap_should_optimize(h, head))
>>>> +             return;
>>>> +
>>>> +     vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>>>> +     vmemmap_reuse   = vmemmap_start;
>>>> +     vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>> +
>>>> +     /*
>>>> +      * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>>> +      * @vmemmap_end]
>>>> +      */
>>>> +     vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>>> +}
>>>> +
>>>>  void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>>>>  {
>>>>      struct folio *folio;
>>>>      LIST_HEAD(vmemmap_pages);
>>>> 
>>>> +     list_for_each_entry(folio, folio_list, lru)
>>>> +             hugetlb_vmemmap_split(h, &folio->page);
>>> 
>>> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
>>> to indicate whether it has done successfully, if it fails, it must be
>>> OOM, in which case, there is no sense to continue to split the page table
>>> and optimize the vmemmap pages subsequently, right?
>> 
>> Sorry, it is reasonable to continue to optimize the vmemmap pages
>> subsequently since it should succeed because those vmemmap pages
>> have been split successfully previously.
>> 
>> Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split()
>> fails, then we will have more memory to continue to split. 
> 
> Good point
> 
>> But it will
>> make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to
>> hear you guys' opinions here.
>> 
> I think it won't add that much complexity if we don't optimize too much of the
> slowpath (when we are out of memory). In the batch freeing patch we could
> additionally test the return value of __hugetlb_vmemmap_optimize() for ENOMEM
> and free the currently stored vmemmap_pages (if any), and keep iterating the
> optimize loop. Should be simple enough and make this a bit more resilient to
> that scenario.

Yep, we could try this.

> But we would need to keep the earlier check you commented above
> (where we use @remap_pte to defer PMD flush).

I think 2 flags will suitable for you, one is VMEMMAP_REMAP_NO_TLB_FLUSH,
another is VMEMMAP_SPLIT_NO_TLB_FLUSH.

Thanks.

> 
>> Thanks.
>> 
>>> 
>>> Thanks.
>>> 
>>>> +
>>>> +     flush_tlb_all();
>>>> +
>>>>      list_for_each_entry(folio, folio_list, lru)
>>>>              __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  9:32         ` [External] " Muchun Song
@ 2023-09-06  9:44           ` Joao Martins
  2023-09-06 11:34             ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2023-09-06  9:44 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Mike Kravetz, Linux-MM, LKML, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton

On 06/09/2023 10:32, Muchun Song wrote:
>> On Sep 6, 2023, at 17:26, Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 06/09/2023 10:11, Muchun Song wrote:
>>> On Wed, Sep 6, 2023 at 4:25 PM Muchun Song <muchun.song@linux.dev> wrote:
>>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>
>>>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>>>> flush.
>>>>>
>>>>> Rebased and updated by Mike Kravetz
>>>>>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> ---
>>>>>  mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>>>>>  1 file changed, 68 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>>> index a715712df831..d956551699bc 100644
>>>>> --- a/mm/hugetlb_vmemmap.c
>>>>> +++ b/mm/hugetlb_vmemmap.c
>>>>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>>>>>      struct list_head        *vmemmap_pages;
>>>>>  };
>>>>>
>>>>> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>>>>>  {
>>>>>      pmd_t __pmd;
>>>>>      int i;
>>>>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>>>              /* Make pte visible before pmd. See comment in pmd_install(). */
>>>>>              smp_wmb();
>>>>>              pmd_populate_kernel(&init_mm, pmd, pgtable);
>>>>> -             flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>>> +             if (flush)
>>>>> +                     flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>>>      } else {
>>>>>              pte_free_kernel(&init_mm, pgtable);
>>>>>      }
>>>>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>>>>>      do {
>>>>>              int ret;
>>>>>
>>>>> -             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>>>>> +             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>>>>> +                             walk->remap_pte != NULL);
>>>>
>>>> It is bettter to only make @walk->remap_pte indicate whether we should go
>>>> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
>>>> to indicate whether we should flush the TLB at pmd level. It'll be more
>>>> clear.
>>>>
>>>>>              if (ret)
>>>>>                      return ret;
>>>>>
>>>>>              next = pmd_addr_end(addr, end);
>>>>> +
>>>>> +             /*
>>>>> +              * We are only splitting, not remapping the hugetlb vmemmap
>>>>> +              * pages.
>>>>> +              */
>>>>> +             if (!walk->remap_pte)
>>>>> +                     continue;
>>>>> +
>>>>>              vmemmap_pte_range(pmd, addr, next, walk);
>>>>>      } while (pmd++, addr = next, addr != end);
>>>>>
>>>>> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>>>>                      return ret;
>>>>>      } while (pgd++, addr = next, addr != end);
>>>>>
>>>>> -     flush_tlb_kernel_range(start, end);
>>>>> +     if (walk->remap_pte)
>>>>> +             flush_tlb_kernel_range(start, end);
>>>>>
>>>>>      return 0;
>>>>>  }
>>>>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>>>      set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>>>>  }
>>>>>
>>>>> +/**
>>>>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>>>>> + *                      backing PMDs of the directmap into PTEs
>>>>> + * @start:     start address of the vmemmap virtual address range that we want
>>>>> + *             to remap.
>>>>> + * @end:       end address of the vmemmap virtual address range that we want to
>>>>> + *             remap.
>>>>> + * @reuse:     reuse address.
>>>>> + *
>>>>> + * Return: %0 on success, negative error code otherwise.
>>>>> + */
>>>>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>>>> +                             unsigned long reuse)
>>>>> +{
>>>>> +     int ret;
>>>>> +     struct vmemmap_remap_walk walk = {
>>>>> +             .remap_pte      = NULL,
>>>>> +     };
>>>>> +
>>>>> +     /* See the comment in the vmemmap_remap_free(). */
>>>>> +     BUG_ON(start - reuse != PAGE_SIZE);
>>>>> +
>>>>> +     mmap_read_lock(&init_mm);
>>>>> +     ret = vmemmap_remap_range(reuse, end, &walk);
>>>>> +     mmap_read_unlock(&init_mm);
>>>>> +
>>>>> +     return ret;
>>>>> +}
>>>>> +
>>>>>  /**
>>>>>   * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>>>>   *                  to the page which @reuse is mapped to, then free vmemmap
>>>>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>>>>>      free_vmemmap_page_list(&vmemmap_pages);
>>>>>  }
>>>>>
>>>>> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>>>> +{
>>>>> +     unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>>>>> +     unsigned long vmemmap_reuse;
>>>>> +
>>>>> +     if (!vmemmap_should_optimize(h, head))
>>>>> +             return;
>>>>> +
>>>>> +     vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>>>>> +     vmemmap_reuse   = vmemmap_start;
>>>>> +     vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>>> +
>>>>> +     /*
>>>>> +      * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>>>> +      * @vmemmap_end]
>>>>> +      */
>>>>> +     vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>>>> +}
>>>>> +
>>>>>  void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>>>>>  {
>>>>>      struct folio *folio;
>>>>>      LIST_HEAD(vmemmap_pages);
>>>>>
>>>>> +     list_for_each_entry(folio, folio_list, lru)
>>>>> +             hugetlb_vmemmap_split(h, &folio->page);
>>>>
>>>> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
>>>> to indicate whether it has done successfully, if it fails, it must be
>>>> OOM, in which case, there is no sense to continue to split the page table
>>>> and optimize the vmemmap pages subsequently, right?
>>>
>>> Sorry, it is reasonable to continue to optimize the vmemmap pages
>>> subsequently since it should succeed because those vmemmap pages
>>> have been split successfully previously.
>>>
>>> Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split()
>>> fails, then we will have more memory to continue to split. 
>>
>> Good point
>>
>>> But it will
>>> make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to
>>> hear you guys' opinions here.
>>>
>> I think it won't add that much complexity if we don't optimize too much of the
>> slowpath (when we are out of memory). In the batch freeing patch we could
>> additionally test the return value of __hugetlb_vmemmap_optimize() for ENOMEM
>> and free the currently stored vmemmap_pages (if any), and keep iterating the
>> optimize loop. Should be simple enough and make this a bit more resilient to
>> that scenario.
> 
> Yep, we could try this.
> 
>> But we would need to keep the earlier check you commented above
>> (where we use @remap_pte to defer PMD flush).
> 
> I think 2 flags will suitable for you, one is VMEMMAP_REMAP_NO_TLB_FLUSH,
> another is VMEMMAP_SPLIT_NO_TLB_FLUSH.

This means going back to the v1. I thought we agreed to consolidate/simplify
into one flag, and use @remap_pte to differentiate between split and remap.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-09-06  9:44           ` Joao Martins
@ 2023-09-06 11:34             ` Muchun Song
  0 siblings, 0 replies; 44+ messages in thread
From: Muchun Song @ 2023-09-06 11:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: Muchun Song, Mike Kravetz, Linux-MM, LKML, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



> On Sep 6, 2023, at 17:44, Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> On 06/09/2023 10:32, Muchun Song wrote:
>>> On Sep 6, 2023, at 17:26, Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 06/09/2023 10:11, Muchun Song wrote:
>>>> On Wed, Sep 6, 2023 at 4:25 PM Muchun Song <muchun.song@linux.dev> wrote:
>>>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>> 
>>>>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>>>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>>>>> flush.
>>>>>> 
>>>>>> Rebased and updated by Mike Kravetz
>>>>>> 
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>> ---
>>>>>> mm/hugetlb_vmemmap.c | 72 +++++++++++++++++++++++++++++++++++++++++---
>>>>>> 1 file changed, 68 insertions(+), 4 deletions(-)
>>>>>> 
>>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>>>> index a715712df831..d956551699bc 100644
>>>>>> --- a/mm/hugetlb_vmemmap.c
>>>>>> +++ b/mm/hugetlb_vmemmap.c
>>>>>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk {
>>>>>>     struct list_head        *vmemmap_pages;
>>>>>> };
>>>>>> 
>>>>>> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>>>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
>>>>>> {
>>>>>>     pmd_t __pmd;
>>>>>>     int i;
>>>>>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>>>>>>             /* Make pte visible before pmd. See comment in pmd_install(). */
>>>>>>             smp_wmb();
>>>>>>             pmd_populate_kernel(&init_mm, pmd, pgtable);
>>>>>> -             flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>>>> +             if (flush)
>>>>>> +                     flush_tlb_kernel_range(start, start + PMD_SIZE);
>>>>>>     } else {
>>>>>>             pte_free_kernel(&init_mm, pgtable);
>>>>>>     }
>>>>>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>>>>>>     do {
>>>>>>             int ret;
>>>>>> 
>>>>>> -             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>>>>>> +             ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>>>>>> +                             walk->remap_pte != NULL);
>>>>> 
>>>>> It is bettter to only make @walk->remap_pte indicate whether we should go
>>>>> to the last page table level. I suggest reusing VMEMMAP_NO_TLB_FLUSH
>>>>> to indicate whether we should flush the TLB at pmd level. It'll be more
>>>>> clear.
>>>>> 
>>>>>>             if (ret)
>>>>>>                     return ret;
>>>>>> 
>>>>>>             next = pmd_addr_end(addr, end);
>>>>>> +
>>>>>> +             /*
>>>>>> +              * We are only splitting, not remapping the hugetlb vmemmap
>>>>>> +              * pages.
>>>>>> +              */
>>>>>> +             if (!walk->remap_pte)
>>>>>> +                     continue;
>>>>>> +
>>>>>>             vmemmap_pte_range(pmd, addr, next, walk);
>>>>>>     } while (pmd++, addr = next, addr != end);
>>>>>> 
>>>>>> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>>>>>                     return ret;
>>>>>>     } while (pgd++, addr = next, addr != end);
>>>>>> 
>>>>>> -     flush_tlb_kernel_range(start, end);
>>>>>> +     if (walk->remap_pte)
>>>>>> +             flush_tlb_kernel_range(start, end);
>>>>>> 
>>>>>>     return 0;
>>>>>> }
>>>>>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>>>>     set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>>>>> }
>>>>>> 
>>>>>> +/**
>>>>>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>>>>>> + *                      backing PMDs of the directmap into PTEs
>>>>>> + * @start:     start address of the vmemmap virtual address range that we want
>>>>>> + *             to remap.
>>>>>> + * @end:       end address of the vmemmap virtual address range that we want to
>>>>>> + *             remap.
>>>>>> + * @reuse:     reuse address.
>>>>>> + *
>>>>>> + * Return: %0 on success, negative error code otherwise.
>>>>>> + */
>>>>>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>>>>> +                             unsigned long reuse)
>>>>>> +{
>>>>>> +     int ret;
>>>>>> +     struct vmemmap_remap_walk walk = {
>>>>>> +             .remap_pte      = NULL,
>>>>>> +     };
>>>>>> +
>>>>>> +     /* See the comment in the vmemmap_remap_free(). */
>>>>>> +     BUG_ON(start - reuse != PAGE_SIZE);
>>>>>> +
>>>>>> +     mmap_read_lock(&init_mm);
>>>>>> +     ret = vmemmap_remap_range(reuse, end, &walk);
>>>>>> +     mmap_read_unlock(&init_mm);
>>>>>> +
>>>>>> +     return ret;
>>>>>> +}
>>>>>> +
>>>>>> /**
>>>>>>  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>>>>>  *                  to the page which @reuse is mapped to, then free vmemmap
>>>>>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>>>>>>     free_vmemmap_page_list(&vmemmap_pages);
>>>>>> }
>>>>>> 
>>>>>> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>>>>> +{
>>>>>> +     unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>>>>>> +     unsigned long vmemmap_reuse;
>>>>>> +
>>>>>> +     if (!vmemmap_should_optimize(h, head))
>>>>>> +             return;
>>>>>> +
>>>>>> +     vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>>>>>> +     vmemmap_reuse   = vmemmap_start;
>>>>>> +     vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>>>> +
>>>>>> +     /*
>>>>>> +      * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>>>>> +      * @vmemmap_end]
>>>>>> +      */
>>>>>> +     vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>>>>> +}
>>>>>> +
>>>>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>>>>>> {
>>>>>>     struct folio *folio;
>>>>>>     LIST_HEAD(vmemmap_pages);
>>>>>> 
>>>>>> +     list_for_each_entry(folio, folio_list, lru)
>>>>>> +             hugetlb_vmemmap_split(h, &folio->page);
>>>>> 
>>>>> Maybe it is reasonable to add a return value to hugetlb_vmemmap_split()
>>>>> to indicate whether it has done successfully, if it fails, it must be
>>>>> OOM, in which case, there is no sense to continue to split the page table
>>>>> and optimize the vmemmap pages subsequently, right?
>>>> 
>>>> Sorry, it is reasonable to continue to optimize the vmemmap pages
>>>> subsequently since it should succeed because those vmemmap pages
>>>> have been split successfully previously.
>>>> 
>>>> Seems we should continue to optimize vmemmap once hugetlb_vmemmap_split()
>>>> fails, then we will have more memory to continue to split. 
>>> 
>>> Good point
>>> 
>>>> But it will
>>>> make hugetlb_vmemmap_optimize_folios() a little complex. I'd like to
>>>> hear you guys' opinions here.
>>>> 
>>> I think it won't add that much complexity if we don't optimize too much of the
>>> slowpath (when we are out of memory). In the batch freeing patch we could
>>> additionally test the return value of __hugetlb_vmemmap_optimize() for ENOMEM
>>> and free the currently stored vmemmap_pages (if any), and keep iterating the
>>> optimize loop. Should be simple enough and make this a bit more resilient to
>>> that scenario.
>> 
>> Yep, we could try this.
>> 
>>> But we would need to keep the earlier check you commented above
>>> (where we use @remap_pte to defer PMD flush).
>> 
>> I think 2 flags will suitable for you, one is VMEMMAP_REMAP_NO_TLB_FLUSH,
>> another is VMEMMAP_SPLIT_NO_TLB_FLUSH.
> 
> This means going back to the v1. I thought we agreed to consolidate/simplify
> into one flag, and use @remap_pte to differentiate between split and remap.

But a little different, we use @remap_pte to indicate whether we should go
to the last level (e.g. do the remap), the flag is used to indicate whether
we should flush TLB when splitting is necessary (note that remap also need
split). It means split and non-TLB flush are not bound. Sorry, I just want
to keep the semantics clear.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-06  7:33   ` Muchun Song
  2023-09-06  8:07     ` Muchun Song
@ 2023-09-06 20:53     ` Mike Kravetz
  1 sibling, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-06 20:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Barry Song, Michal Hocko,
	Matthew Wilcox, Xiongchun Duan, Andrew Morton

On 09/06/23 15:33, Muchun Song wrote:
> 
> 
> On 2023/9/6 05:44, Mike Kravetz wrote:
> > When removing hugetlb pages from the pool, we first create a list
> > of removed pages and then free those pages back to low level allocators.
> > Part of the 'freeing process' is to restore vmemmap for all base pages
> > if necessary.  Pass this list of pages to a new routine
> > hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> > performed in bulk.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb.c         |  3 +++
> >   mm/hugetlb_vmemmap.c | 13 +++++++++++++
> >   mm/hugetlb_vmemmap.h |  5 +++++
> >   3 files changed, 21 insertions(+)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 554be94b07bd..dd2dbc256172 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> >   {
> >   	struct folio *folio, *t_folio;
> > +	/* First restore vmemmap for all pages on list. */
> > +	hugetlb_vmemmap_restore_folios(h, list);
> > +
> >   	list_for_each_entry_safe(folio, t_folio, list, lru) {
> >   		update_and_free_hugetlb_folio(h, folio, false);
> >   		cond_resched();
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index ac5577d372fe..79de984919ef 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >   	return ret;
> >   }
> > +/*
> > + * This function will attempt to resore vmemmap for a list of folios.  There
> > + * is no guarantee that restoration will be successful for all or any folios.
> > + * This is used in bulk operations, and no feedback is given to the caller.
> > + */
> > +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> > +{
> > +	struct folio *folio;
> > +
> > +	list_for_each_entry(folio, folio_list, lru)
> > +		(void)hugetlb_vmemmap_restore(h, &folio->page);
> 
> I am curious about the purpose of "void" here, seems it it not necessnary,
> ritgh? We cound see so many palces where we do not add the void if the
> caller
> does not care about the return value of the callee.

Yes, we can drop the (void).  Sorry about adding that.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-06  8:07     ` Muchun Song
@ 2023-09-06 21:12       ` Mike Kravetz
  2023-09-07  3:33         ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-06 21:12 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 09/06/23 16:07, Muchun Song wrote:
> > On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
> > On 2023/9/6 05:44, Mike Kravetz wrote:
> >> When removing hugetlb pages from the pool, we first create a list
> >> of removed pages and then free those pages back to low level allocators.
> >> Part of the 'freeing process' is to restore vmemmap for all base pages
> >> if necessary.  Pass this list of pages to a new routine
> >> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> >> performed in bulk.
> >> 
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> ---
> >>  mm/hugetlb.c         |  3 +++
> >>  mm/hugetlb_vmemmap.c | 13 +++++++++++++
> >>  mm/hugetlb_vmemmap.h |  5 +++++
> >>  3 files changed, 21 insertions(+)
> >> 
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 554be94b07bd..dd2dbc256172 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> >>  {
> >>   struct folio *folio, *t_folio;
> >>  + /* First restore vmemmap for all pages on list. */
> >> + hugetlb_vmemmap_restore_folios(h, list);
> >> +
> >>   list_for_each_entry_safe(folio, t_folio, list, lru) {
> >>   update_and_free_hugetlb_folio(h, folio, false);
> >>   cond_resched();
> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> >> index ac5577d372fe..79de984919ef 100644
> >> --- a/mm/hugetlb_vmemmap.c
> >> +++ b/mm/hugetlb_vmemmap.c
> >> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >>   return ret;
> >>  }
> >>  +/*
> >> + * This function will attempt to resore vmemmap for a list of folios.  There
> >> + * is no guarantee that restoration will be successful for all or any folios.
> >> + * This is used in bulk operations, and no feedback is given to the caller.
> >> + */
> >> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> >> +{
> >> + struct folio *folio;
> >> +
> >> + list_for_each_entry(folio, folio_list, lru)
> >> + (void)hugetlb_vmemmap_restore(h, &folio->page);
> > 
> > I am curious about the purpose of "void" here, seems it it not necessnary,
> > ritgh? We cound see so many palces where we do not add the void if the caller
> > does not care about the return value of the callee.
> 
> Another question: should we stop restoring vmemmap pages when
> hugetlb_vmemmap_restore() fails? In which case, I suspect there
> is no memory probably, there is no need to continue, right?

Recall that the list of hugetlb pages may be from multiple nodes.  My first
thought was that we should continue because memory allocation may fail on one
node but succeed on another.  However, with
https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
memory allocation should fall back to other nodes.  So, yes I do believe it
would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
we are unlikely to make forward progress.

Today's behavior will try to restore vmemmap for all pages.  No stopping
on error.

I have mixed thoughts on this.  Quitting on error 'seems reasonable'.
However, if we continue we 'might' be able to allocate vmemmap for one
hugetlb page.  And, if we free one hugetlb page that should provide
vmemmap for several more and we may be able to free most pages on the
list.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages
  2023-09-06  7:38   ` Muchun Song
@ 2023-09-06 21:38     ` Mike Kravetz
  2023-09-07  6:19       ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-06 21:38 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 09/06/23 15:38, Muchun Song wrote:
> 
> 
> On 2023/9/6 05:44, Mike Kravetz wrote:
> > Now that batching of hugetlb vmemmap optimization processing is possible,
> > batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
> > hugetlb page, we add them to a list that is freed after the entire batch
> > has been processed.
> > 
> > This enhances the ability to return contiguous ranges of memory to the
> > low level allocators.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 60 ++++++++++++++++++++++++++++----------------
> >   1 file changed, 38 insertions(+), 22 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 79de984919ef..a715712df831 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -306,18 +306,21 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >    * @end:	end address of the vmemmap virtual address range that we want to
> >    *		remap.
> >    * @reuse:	reuse address.
> > + * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
> > + *		responsibility to free pages.
> >    *
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> >   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > -			      unsigned long reuse)
> > +			      unsigned long reuse,
> > +			      struct list_head *vmemmap_pages)
> >   {
> >   	int ret;
> > -	LIST_HEAD(vmemmap_pages);
> > +	LIST_HEAD(freed_pages);
> 
> IIUC, we could reuse the parameter of @vmemmap_pages directly instead of
> a temporary variable, could it be dropped?
> 

I was concerned about the error case where we call vmemmap_remap_range a
second time.  In the first call to vmemmap_remap_range with vmemmap_remap_pte,
vmemmap pages to be freed are added to the end of the list (list_add_tail).
In the call to vmemmap_remap_range after error with vmemmap_restore_pte,
pages are taken off the head of the list (list_first_entry).  So, it seems
that it would be possible to use a different set of pages in the restore
operation.  This would be an issue if pages had different characteristics such
as being on different nodes.  Is that a real concern?

I suppose we could change vmemmap_remap_pte to add pages to the head of
the list?  I do not recall the reasoning behind adding to tail.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-06 21:12       ` Mike Kravetz
@ 2023-09-07  3:33         ` Muchun Song
  2023-09-07 18:54           ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-07  3:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton



> On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 09/06/23 16:07, Muchun Song wrote:
>>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>> When removing hugetlb pages from the pool, we first create a list
>>>> of removed pages and then free those pages back to low level allocators.
>>>> Part of the 'freeing process' is to restore vmemmap for all base pages
>>>> if necessary.  Pass this list of pages to a new routine
>>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
>>>> performed in bulk.
>>>> 
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>> mm/hugetlb.c         |  3 +++
>>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++
>>>> mm/hugetlb_vmemmap.h |  5 +++++
>>>> 3 files changed, 21 insertions(+)
>>>> 
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 554be94b07bd..dd2dbc256172 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>>> {
>>>>  struct folio *folio, *t_folio;
>>>> + /* First restore vmemmap for all pages on list. */
>>>> + hugetlb_vmemmap_restore_folios(h, list);
>>>> +
>>>>  list_for_each_entry_safe(folio, t_folio, list, lru) {
>>>>  update_and_free_hugetlb_folio(h, folio, false);
>>>>  cond_resched();
>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>> index ac5577d372fe..79de984919ef 100644
>>>> --- a/mm/hugetlb_vmemmap.c
>>>> +++ b/mm/hugetlb_vmemmap.c
>>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>>>  return ret;
>>>> }
>>>> +/*
>>>> + * This function will attempt to resore vmemmap for a list of folios.  There
>>>> + * is no guarantee that restoration will be successful for all or any folios.
>>>> + * This is used in bulk operations, and no feedback is given to the caller.
>>>> + */
>>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>>>> +{
>>>> + struct folio *folio;
>>>> +
>>>> + list_for_each_entry(folio, folio_list, lru)
>>>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
>>> 
>>> I am curious about the purpose of "void" here, seems it it not necessnary,
>>> ritgh? We cound see so many palces where we do not add the void if the caller
>>> does not care about the return value of the callee.
>> 
>> Another question: should we stop restoring vmemmap pages when
>> hugetlb_vmemmap_restore() fails? In which case, I suspect there
>> is no memory probably, there is no need to continue, right?
> 
> Recall that the list of hugetlb pages may be from multiple nodes.  My first
> thought was that we should continue because memory allocation may fail on one
> node but succeed on another.  However, with
> https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
> memory allocation should fall back to other nodes.  So, yes I do believe it
> would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
> we are unlikely to make forward progress.

Agree.

> 
> Today's behavior will try to restore vmemmap for all pages.  No stopping
> on error.
> 
> I have mixed thoughts on this.  Quitting on error 'seems reasonable'.
> However, if we continue we 'might' be able to allocate vmemmap for one
> hugetlb page.  And, if we free one hugetlb page that should provide
> vmemmap for several more and we may be able to free most pages on the
> list.

Yes. A good point. But there should be a non-optimized huge page been
freed somewhere in parallel, otherwise we still cannot allocate memory.
However, the freeing operation happens after hugetlb_vmemmap_restore_folios.
If we want to handle this, we should rework update_and_free_pages_bulk()
to do a try when at least a huge pages is freed.

Thanks.

> -- 
> Mike Kravetz


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages
  2023-09-06 21:38     ` Mike Kravetz
@ 2023-09-07  6:19       ` Muchun Song
  2023-09-07 18:47         ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-07  6:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



> On Sep 7, 2023, at 05:38, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 09/06/23 15:38, Muchun Song wrote:
>> 
>> 
>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>> Now that batching of hugetlb vmemmap optimization processing is possible,
>>> batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
>>> hugetlb page, we add them to a list that is freed after the entire batch
>>> has been processed.
>>> 
>>> This enhances the ability to return contiguous ranges of memory to the
>>> low level allocators.
>>> 
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  mm/hugetlb_vmemmap.c | 60 ++++++++++++++++++++++++++++----------------
>>>  1 file changed, 38 insertions(+), 22 deletions(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index 79de984919ef..a715712df831 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -306,18 +306,21 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>   * @end: end address of the vmemmap virtual address range that we want to
>>>   * remap.
>>>   * @reuse: reuse address.
>>> + * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
>>> + * responsibility to free pages.
>>>   *
>>>   * Return: %0 on success, negative error code otherwise.
>>>   */
>>>  static int vmemmap_remap_free(unsigned long start, unsigned long end,
>>> -       unsigned long reuse)
>>> +       unsigned long reuse,
>>> +       struct list_head *vmemmap_pages)
>>>  {
>>>   int ret;
>>> - LIST_HEAD(vmemmap_pages);
>>> + LIST_HEAD(freed_pages);
>> 
>> IIUC, we could reuse the parameter of @vmemmap_pages directly instead of
>> a temporary variable, could it be dropped?
>> 
> 
> I was concerned about the error case where we call vmemmap_remap_range a
> second time.  In the first call to vmemmap_remap_range with vmemmap_remap_pte,
> vmemmap pages to be freed are added to the end of the list (list_add_tail).
> In the call to vmemmap_remap_range after error with vmemmap_restore_pte,
> pages are taken off the head of the list (list_first_entry).  So, it seems
> that it would be possible to use a different set of pages in the restore

Yes.

> operation.  This would be an issue if pages had different characteristics such
> as being on different nodes.  Is that a real concern?

A good point. Now I see your concern, it is better to keep the same node
as before when error occurs.

> 
> I suppose we could change vmemmap_remap_pte to add pages to the head of
> the list?  I do not recall the reasoning behind adding to tail.

I think we could do this, the code will be a little simple. Actually, there
is no reason behind adding to tail (BTW, the first commit is introduced by
me, no secret here :-)).

Thanks.

> -- 
> Mike Kravetz



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap
  2023-09-05 21:44 ` [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
@ 2023-09-07  6:55   ` Muchun Song
  2023-09-07 18:57     ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-07  6:55 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
>
> Now that a list of pages is deduplicated at once, the TLB
> flush can be batched for all vmemmap pages that got remapped.
>
> Add a flags field and pass whether it's a bulk allocation or
> just a single page to decide to remap.
>
> The TLB flush is global as we don't have guarantees from caller
> that the set of folios is contiguous, or to add complexity in
> composing a list of kVAs to flush.
>
> Modified by Mike Kravetz to perform TLB flush on single folio if an
> error is encountered.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 38 ++++++++++++++++++++++++++++++--------
>   1 file changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index d956551699bc..8c85e2c38538 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -27,6 +27,7 @@
>    * @reuse_addr:		the virtual address of the @reuse_page page.
>    * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
>    *			or is mapped from.
> + * @flags:		used to modify behavior in bulk operations
>    */
>   struct vmemmap_remap_walk {
>   	void			(*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -35,6 +36,8 @@ struct vmemmap_remap_walk {
>   	struct page		*reuse_page;
>   	unsigned long		reuse_addr;
>   	struct list_head	*vmemmap_pages;
> +#define VMEMMAP_NO_TLB_FLUSH		BIT(0)
> +	unsigned long		flags;
>   };
>   
>   static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
> @@ -208,7 +211,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>   			return ret;
>   	} while (pgd++, addr = next, addr != end);
>   
> -	if (walk->remap_pte)
> +	if (walk->remap_pte && !(walk->flags & VMEMMAP_NO_TLB_FLUSH))
>   		flush_tlb_kernel_range(start, end);
>   
>   	return 0;
> @@ -348,12 +351,14 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
>    * @reuse:	reuse address.
>    * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
>    *		responsibility to free pages.
> + * @flags:	modifications to vmemmap_remap_walk flags
>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   			      unsigned long reuse,
> -			      struct list_head *vmemmap_pages)
> +			      struct list_head *vmemmap_pages,
> +			      unsigned long flags)
>   {
>   	int ret;
>   	LIST_HEAD(freed_pages);
> @@ -361,6 +366,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   		.remap_pte	= vmemmap_remap_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &freed_pages,
> +		.flags		= flags,
>   	};
>   	int nid = page_to_nid((struct page *)start);
>   	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> @@ -410,6 +416,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   			.remap_pte	= vmemmap_restore_pte,
>   			.reuse_addr	= reuse,
>   			.vmemmap_pages	= &freed_pages,
> +			.flags		= 0,
>   		};
>   
>   		vmemmap_remap_range(reuse, end, &walk);
> @@ -597,7 +604,8 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
>   
>   static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>   					struct page *head,
> -					struct list_head *vmemmap_pages)
> +					struct list_head *vmemmap_pages,
> +					unsigned long flags)
>   {
>   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>   	unsigned long vmemmap_reuse;
> @@ -607,6 +615,18 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>   		return;
>   
>   	static_branch_inc(&hugetlb_optimize_vmemmap_key);
> +	/*
> +	 * Very Subtle
> +	 * If VMEMMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
> +	 * immediately after remapping.  As a result, subsequent accesses
> +	 * and modifications to struct pages associated with the hugetlb
> +	 * page could bet to the OLD struct pages.  Set the vmemmap optimized
> +	 * flag here so that it is copied to the new head page.  This keeps
> +	 * the old and new struct pages in sync.
> +	 * If there is an error during optimization, we will immediately FLUSH
> +	 * the TLB and clear the flag below.
> +	 */
> +	SetHPageVmemmapOptimized(head);
>   
>   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
>   	vmemmap_reuse	= vmemmap_start;
> @@ -618,10 +638,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>   	 * mapping the range to vmemmap_pages list so that they can be freed by
>   	 * the caller.
>   	 */
> -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages))
> +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages, flags)) {
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> -	else
> -		SetHPageVmemmapOptimized(head);
> +		ClearHPageVmemmapOptimized(head);
> +	}
>   }
>   
>   /**
> @@ -638,7 +658,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   {
>   	LIST_HEAD(vmemmap_pages);
>   
> -	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
> +	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages, 0UL);

UL suffix could be dropped. Right?

>   	free_vmemmap_page_list(&vmemmap_pages);
>   }
>   
> @@ -672,7 +692,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
>   	flush_tlb_all();
>   
>   	list_for_each_entry(folio, folio_list, lru)
> -		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
> +		__hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages, VMEMMAP_NO_TLB_FLUSH);
> +
> +	flush_tlb_all();
>   
>   	free_vmemmap_page_list(&vmemmap_pages);
>   }


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap
  2023-09-05 21:44 ` [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
@ 2023-09-07  6:58   ` Muchun Song
  2023-09-07 18:58     ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2023-09-07  6:58 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, Joao Martins
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Michal Hocko,
	Matthew Wilcox, Xiongchun Duan, Andrew Morton



On 2023/9/6 05:44, Mike Kravetz wrote:
> Update the hugetlb_vmemmap_restore path to take a 'batch' parameter that

s/batch/flags/g

And it should be reworked since the parameter has been changed.

> indicates restoration is happening on a batch of pages.  When set, use
> the existing mechanism (VMEMMAP_NO_TLB_FLUSH) to delay TLB flushing.
> The routine hugetlb_vmemmap_restore_folios is the only user of this new
> batch parameter and it will perform a global flush after all vmemmap is
> restored.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++--------------
>   1 file changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 8c85e2c38538..11fda9d061eb 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -458,17 +458,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
>    * @reuse:	reuse address.
> + * @flags:	modify behavior for bulk operations
>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> -			       unsigned long reuse)
> +			       unsigned long reuse, unsigned long flags)
>   {
>   	LIST_HEAD(vmemmap_pages);
>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= vmemmap_restore_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &vmemmap_pages,
> +		.flags		= flags,
>   	};
>   
>   	/* See the comment in the vmemmap_remap_free(). */
> @@ -490,17 +492,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
>   static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
>   core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
>   
> -/**
> - * hugetlb_vmemmap_restore - restore previously optimized (by
> - *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> - *			     will be reallocated and remapped.
> - * @h:		struct hstate.
> - * @head:	the head page whose vmemmap pages will be restored.
> - *
> - * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> - * negative error code otherwise.
> - */
> -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
>   {
>   	int ret;
>   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> @@ -521,7 +513,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	 * When a HugeTLB page is freed to the buddy allocator, previously
>   	 * discarded vmemmap pages must be allocated and remapping.
>   	 */
> -	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
> +	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
>   	if (!ret) {
>   		ClearHPageVmemmapOptimized(head);
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> @@ -530,6 +522,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	return ret;
>   }
>   
> +/**
> + * hugetlb_vmemmap_restore - restore previously optimized (by
> + *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> + *			     will be reallocated and remapped.
> + * @h:		struct hstate.
> + * @head:	the head page whose vmemmap pages will be restored.
> + *
> + * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> + * negative error code otherwise.
> + */
> +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +{
> +	return __hugetlb_vmemmap_restore(h, head, 0UL);

UL suffix could be drooped.

Thanks.

> +}
> +
>   /*
>    * This function will attempt to resore vmemmap for a list of folios.  There
>    * is no guarantee that restoration will be successful for all or any folios.
> @@ -540,7 +547,9 @@ void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *fo
>   	struct folio *folio;
>   
>   	list_for_each_entry(folio, folio_list, lru)
> -		(void)hugetlb_vmemmap_restore(h, &folio->page);
> +		(void)__hugetlb_vmemmap_restore(h, &folio->page, VMEMMAP_NO_TLB_FLUSH);
> +
> +	flush_tlb_all();
>   }
>   
>   /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages
  2023-09-07  6:19       ` Muchun Song
@ 2023-09-07 18:47         ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton

On 09/07/23 14:19, Muchun Song wrote:
> 
> 
> > On Sep 7, 2023, at 05:38, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > On 09/06/23 15:38, Muchun Song wrote:
> >> 
> >> 
> >> On 2023/9/6 05:44, Mike Kravetz wrote:
> >>> Now that batching of hugetlb vmemmap optimization processing is possible,
> >>> batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
> >>> hugetlb page, we add them to a list that is freed after the entire batch
> >>> has been processed.
> >>> 
> >>> This enhances the ability to return contiguous ranges of memory to the
> >>> low level allocators.
> >>> 
> >>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>> ---
> >>>  mm/hugetlb_vmemmap.c | 60 ++++++++++++++++++++++++++++----------------
> >>>  1 file changed, 38 insertions(+), 22 deletions(-)
> >>> 
> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> >>> index 79de984919ef..a715712df831 100644
> >>> --- a/mm/hugetlb_vmemmap.c
> >>> +++ b/mm/hugetlb_vmemmap.c
> >>> @@ -306,18 +306,21 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >>>   * @end: end address of the vmemmap virtual address range that we want to
> >>>   * remap.
> >>>   * @reuse: reuse address.
> >>> + * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
> >>> + * responsibility to free pages.
> >>>   *
> >>>   * Return: %0 on success, negative error code otherwise.
> >>>   */
> >>>  static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >>> -       unsigned long reuse)
> >>> +       unsigned long reuse,
> >>> +       struct list_head *vmemmap_pages)
> >>>  {
> >>>   int ret;
> >>> - LIST_HEAD(vmemmap_pages);
> >>> + LIST_HEAD(freed_pages);
> >> 
> >> IIUC, we could reuse the parameter of @vmemmap_pages directly instead of
> >> a temporary variable, could it be dropped?
> >> 
> > 
> > I was concerned about the error case where we call vmemmap_remap_range a
> > second time.  In the first call to vmemmap_remap_range with vmemmap_remap_pte,
> > vmemmap pages to be freed are added to the end of the list (list_add_tail).
> > In the call to vmemmap_remap_range after error with vmemmap_restore_pte,
> > pages are taken off the head of the list (list_first_entry).  So, it seems
> > that it would be possible to use a different set of pages in the restore
> 
> Yes.
> 
> > operation.  This would be an issue if pages had different characteristics such
> > as being on different nodes.  Is that a real concern?
> 
> A good point. Now I see your concern, it is better to keep the same node
> as before when error occurs.
> 
> > 
> > I suppose we could change vmemmap_remap_pte to add pages to the head of
> > the list?  I do not recall the reasoning behind adding to tail.
> 
> I think we could do this, the code will be a little simple. Actually, there
> is no reason behind adding to tail (BTW, the first commit is introduced by
> me, no secret here :-)).

Ok, I will change the way pages are added and removed from the list so
that in case of error we get the same pages.  Then I can remove the
local list.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-07  3:33         ` Muchun Song
@ 2023-09-07 18:54           ` Mike Kravetz
  2023-09-08 20:53             ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:54 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 09/07/23 11:33, Muchun Song wrote:
> 
> 
> > On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > On 09/06/23 16:07, Muchun Song wrote:
> >>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
> >>> On 2023/9/6 05:44, Mike Kravetz wrote:
> >>>> When removing hugetlb pages from the pool, we first create a list
> >>>> of removed pages and then free those pages back to low level allocators.
> >>>> Part of the 'freeing process' is to restore vmemmap for all base pages
> >>>> if necessary.  Pass this list of pages to a new routine
> >>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> >>>> performed in bulk.
> >>>> 
> >>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>> ---
> >>>> mm/hugetlb.c         |  3 +++
> >>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++
> >>>> mm/hugetlb_vmemmap.h |  5 +++++
> >>>> 3 files changed, 21 insertions(+)
> >>>> 
> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>> index 554be94b07bd..dd2dbc256172 100644
> >>>> --- a/mm/hugetlb.c
> >>>> +++ b/mm/hugetlb.c
> >>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> >>>> {
> >>>>  struct folio *folio, *t_folio;
> >>>> + /* First restore vmemmap for all pages on list. */
> >>>> + hugetlb_vmemmap_restore_folios(h, list);
> >>>> +
> >>>>  list_for_each_entry_safe(folio, t_folio, list, lru) {
> >>>>  update_and_free_hugetlb_folio(h, folio, false);
> >>>>  cond_resched();
> >>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> >>>> index ac5577d372fe..79de984919ef 100644
> >>>> --- a/mm/hugetlb_vmemmap.c
> >>>> +++ b/mm/hugetlb_vmemmap.c
> >>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >>>>  return ret;
> >>>> }
> >>>> +/*
> >>>> + * This function will attempt to resore vmemmap for a list of folios.  There
> >>>> + * is no guarantee that restoration will be successful for all or any folios.
> >>>> + * This is used in bulk operations, and no feedback is given to the caller.
> >>>> + */
> >>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> >>>> +{
> >>>> + struct folio *folio;
> >>>> +
> >>>> + list_for_each_entry(folio, folio_list, lru)
> >>>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
> >>> 
> >>> I am curious about the purpose of "void" here, seems it it not necessnary,
> >>> ritgh? We cound see so many palces where we do not add the void if the caller
> >>> does not care about the return value of the callee.
> >> 
> >> Another question: should we stop restoring vmemmap pages when
> >> hugetlb_vmemmap_restore() fails? In which case, I suspect there
> >> is no memory probably, there is no need to continue, right?
> > 
> > Recall that the list of hugetlb pages may be from multiple nodes.  My first
> > thought was that we should continue because memory allocation may fail on one
> > node but succeed on another.  However, with
> > https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
> > memory allocation should fall back to other nodes.  So, yes I do believe it
> > would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
> > we are unlikely to make forward progress.
> 
> Agree.
> 
> > 
> > Today's behavior will try to restore vmemmap for all pages.  No stopping
> > on error.
> > 
> > I have mixed thoughts on this.  Quitting on error 'seems reasonable'.
> > However, if we continue we 'might' be able to allocate vmemmap for one
> > hugetlb page.  And, if we free one hugetlb page that should provide
> > vmemmap for several more and we may be able to free most pages on the
> > list.
> 
> Yes. A good point. But there should be a non-optimized huge page been
> freed somewhere in parallel, otherwise we still cannot allocate memory.

It does not have to be another huge page being freed in parallel.  It
could be that when allocating vmemmap for a 1G hugetlb page we were one
(4K) page short of what was required.  If someone else frees a 4K page,
freeing the next 1G page may succeed.
-- 
Mike Kravetz

> However, the freeing operation happens after hugetlb_vmemmap_restore_folios.
> If we want to handle this, we should rework update_and_free_pages_bulk()
> to do a try when at least a huge pages is freed.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap
  2023-09-07  6:55   ` Muchun Song
@ 2023-09-07 18:57     ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Barry Song, Michal Hocko,
	Matthew Wilcox, Xiongchun Duan, Andrew Morton

On 09/07/23 14:55, Muchun Song wrote:
> 
> 
> On 2023/9/6 05:44, Mike Kravetz wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> > 
> > Now that a list of pages is deduplicated at once, the TLB
> > flush can be batched for all vmemmap pages that got remapped.
> > 
> > Add a flags field and pass whether it's a bulk allocation or
> > just a single page to decide to remap.
> > 
> > The TLB flush is global as we don't have guarantees from caller
> > that the set of folios is contiguous, or to add complexity in
> > composing a list of kVAs to flush.
> > 
> > Modified by Mike Kravetz to perform TLB flush on single folio if an
> > error is encountered.
> > 
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 38 ++++++++++++++++++++++++++++++--------
> >   1 file changed, 30 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c

> > @@ -638,7 +658,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> >   {
> >   	LIST_HEAD(vmemmap_pages);
> > -	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
> > +	__hugetlb_vmemmap_optimize(h, head, &vmemmap_pages, 0UL);
> 
> UL suffix could be dropped. Right?

Yes, it can be dropped.

> 
> >   	free_vmemmap_page_list(&vmemmap_pages);
> >   }

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap
  2023-09-07  6:58   ` Muchun Song
@ 2023-09-07 18:58     ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-09-07 18:58 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, Joao Martins, Muchun Song,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 09/07/23 14:58, Muchun Song wrote:
> 
> 
> On 2023/9/6 05:44, Mike Kravetz wrote:
> > Update the hugetlb_vmemmap_restore path to take a 'batch' parameter that
> 
> s/batch/flags/g
> 
> And it should be reworked since the parameter has been changed.

Yes.

> 
> > indicates restoration is happening on a batch of pages.  When set, use
> > the existing mechanism (VMEMMAP_NO_TLB_FLUSH) to delay TLB flushing.
> > The routine hugetlb_vmemmap_restore_folios is the only user of this new
> > batch parameter and it will perform a global flush after all vmemmap is
> > restored.
> > 
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++--------------
> >   1 file changed, 23 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 8c85e2c38538..11fda9d061eb 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -458,17 +458,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> >    * @end:	end address of the vmemmap virtual address range that we want to
> >    *		remap.
> >    * @reuse:	reuse address.
> > + * @flags:	modify behavior for bulk operations
> >    *
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> >   static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> > -			       unsigned long reuse)
> > +			       unsigned long reuse, unsigned long flags)
> >   {
> >   	LIST_HEAD(vmemmap_pages);
> >   	struct vmemmap_remap_walk walk = {
> >   		.remap_pte	= vmemmap_restore_pte,
> >   		.reuse_addr	= reuse,
> >   		.vmemmap_pages	= &vmemmap_pages,
> > +		.flags		= flags,
> >   	};
> >   	/* See the comment in the vmemmap_remap_free(). */
> > @@ -490,17 +492,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
> >   static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
> >   core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
> > -/**
> > - * hugetlb_vmemmap_restore - restore previously optimized (by
> > - *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> > - *			     will be reallocated and remapped.
> > - * @h:		struct hstate.
> > - * @head:	the head page whose vmemmap pages will be restored.
> > - *
> > - * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> > - * negative error code otherwise.
> > - */
> > -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> > +static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
> >   {
> >   	int ret;
> >   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> > @@ -521,7 +513,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >   	 * When a HugeTLB page is freed to the buddy allocator, previously
> >   	 * discarded vmemmap pages must be allocated and remapping.
> >   	 */
> > -	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
> > +	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
> >   	if (!ret) {
> >   		ClearHPageVmemmapOptimized(head);
> >   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> > @@ -530,6 +522,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >   	return ret;
> >   }
> > +/**
> > + * hugetlb_vmemmap_restore - restore previously optimized (by
> > + *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> > + *			     will be reallocated and remapped.
> > + * @h:		struct hstate.
> > + * @head:	the head page whose vmemmap pages will be restored.
> > + *
> > + * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> > + * negative error code otherwise.
> > + */
> > +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> > +{
> > +	return __hugetlb_vmemmap_restore(h, head, 0UL);
> 
> UL suffix could be drooped.

Thanks, will fix both in next version.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-07 18:54           ` Mike Kravetz
@ 2023-09-08 20:53             ` Mike Kravetz
  2023-09-11  3:10               ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-09-08 20:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 09/07/23 11:54, Mike Kravetz wrote:
> On 09/07/23 11:33, Muchun Song wrote:
> > 
> > 
> > > On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > 
> > > On 09/06/23 16:07, Muchun Song wrote:
> > >>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
> > >>> On 2023/9/6 05:44, Mike Kravetz wrote:
> > >>>> When removing hugetlb pages from the pool, we first create a list
> > >>>> of removed pages and then free those pages back to low level allocators.
> > >>>> Part of the 'freeing process' is to restore vmemmap for all base pages
> > >>>> if necessary.  Pass this list of pages to a new routine
> > >>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> > >>>> performed in bulk.
> > >>>> 
> > >>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > >>>> ---
> > >>>> mm/hugetlb.c         |  3 +++
> > >>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++
> > >>>> mm/hugetlb_vmemmap.h |  5 +++++
> > >>>> 3 files changed, 21 insertions(+)
> > >>>> 
> > >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >>>> index 554be94b07bd..dd2dbc256172 100644
> > >>>> --- a/mm/hugetlb.c
> > >>>> +++ b/mm/hugetlb.c
> > >>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> > >>>> {
> > >>>>  struct folio *folio, *t_folio;
> > >>>> + /* First restore vmemmap for all pages on list. */
> > >>>> + hugetlb_vmemmap_restore_folios(h, list);
> > >>>> +
> > >>>>  list_for_each_entry_safe(folio, t_folio, list, lru) {
> > >>>>  update_and_free_hugetlb_folio(h, folio, false);
> > >>>>  cond_resched();
> > >>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > >>>> index ac5577d372fe..79de984919ef 100644
> > >>>> --- a/mm/hugetlb_vmemmap.c
> > >>>> +++ b/mm/hugetlb_vmemmap.c
> > >>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> > >>>>  return ret;
> > >>>> }
> > >>>> +/*
> > >>>> + * This function will attempt to resore vmemmap for a list of folios.  There
> > >>>> + * is no guarantee that restoration will be successful for all or any folios.
> > >>>> + * This is used in bulk operations, and no feedback is given to the caller.
> > >>>> + */
> > >>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> > >>>> +{
> > >>>> + struct folio *folio;
> > >>>> +
> > >>>> + list_for_each_entry(folio, folio_list, lru)
> > >>>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
> > >>> 
> > >>> I am curious about the purpose of "void" here, seems it it not necessnary,
> > >>> ritgh? We cound see so many palces where we do not add the void if the caller
> > >>> does not care about the return value of the callee.
> > >> 
> > >> Another question: should we stop restoring vmemmap pages when
> > >> hugetlb_vmemmap_restore() fails? In which case, I suspect there
> > >> is no memory probably, there is no need to continue, right?
> > > 
> > > Recall that the list of hugetlb pages may be from multiple nodes.  My first
> > > thought was that we should continue because memory allocation may fail on one
> > > node but succeed on another.  However, with
> > > https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
> > > memory allocation should fall back to other nodes.  So, yes I do believe it
> > > would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
> > > we are unlikely to make forward progress.
> > 
> > Agree.
> > 
> > > 
> > > Today's behavior will try to restore vmemmap for all pages.  No stopping
> > > on error.
> > > 
> > > I have mixed thoughts on this.  Quitting on error 'seems reasonable'.
> > > However, if we continue we 'might' be able to allocate vmemmap for one
> > > hugetlb page.  And, if we free one hugetlb page that should provide
> > > vmemmap for several more and we may be able to free most pages on the
> > > list.
> > 
> > Yes. A good point. But there should be a non-optimized huge page been
> > freed somewhere in parallel, otherwise we still cannot allocate memory.
> 
> It does not have to be another huge page being freed in parallel.  It
> could be that when allocating vmemmap for a 1G hugetlb page we were one
> (4K) page short of what was required.  If someone else frees a 4K page,
> freeing the next 1G page may succeed.
> -- 
> Mike Kravetz
> 
> > However, the freeing operation happens after hugetlb_vmemmap_restore_folios.
> > If we want to handle this, we should rework update_and_free_pages_bulk()
> > to do a try when at least a huge pages is freed.

This seemed familiar.  Recall this patch which Muchun Reviewed and James Acked,
https://lore.kernel.org/linux-mm/20230718004942.113174-3-mike.kravetz@oracle.com/

If we can not restore vmemmap for a page, then it must be turned into a
surplus huge page.  In this patch (not the previous one referenced), we
will try to restore vmemmap one more time in a later call to
update_and_free_hugetlb_folio.  Certainly, we do not want to try twice!

My 'plan' is to include the previous patch as part of this series.  With
that patch in place, the list_for_each_entry calls to hugetlb_vmemmap_restore
can be replaced with a call to hugetlb_vmemmap_restore_folios.  We would
change the behavior of hugetlb_vmemmap_restore_folios to return an error
instead of being of type void.  If an error is returned, then we will
make another pass through the list looking for unoptimized pages and add
them as surplus.

I think it best if we try to restore vmemmap at least once before
converting to a surplus page.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 07/11] hugetlb: perform vmemmap restoration on a list of pages
  2023-09-08 20:53             ` Mike Kravetz
@ 2023-09-11  3:10               ` Muchun Song
  0 siblings, 0 replies; 44+ messages in thread
From: Muchun Song @ 2023-09-11  3:10 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton



> On Sep 9, 2023, at 04:53, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 09/07/23 11:54, Mike Kravetz wrote:
>> On 09/07/23 11:33, Muchun Song wrote:
>>> 
>>> 
>>>> On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>> 
>>>> On 09/06/23 16:07, Muchun Song wrote:
>>>>>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@linux.dev> wrote:
>>>>>> On 2023/9/6 05:44, Mike Kravetz wrote:
>>>>>>> When removing hugetlb pages from the pool, we first create a list
>>>>>>> of removed pages and then free those pages back to low level allocators.
>>>>>>> Part of the 'freeing process' is to restore vmemmap for all base pages
>>>>>>> if necessary.  Pass this list of pages to a new routine
>>>>>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
>>>>>>> performed in bulk.
>>>>>>> 
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>> ---
>>>>>>> mm/hugetlb.c         |  3 +++
>>>>>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++
>>>>>>> mm/hugetlb_vmemmap.h |  5 +++++
>>>>>>> 3 files changed, 21 insertions(+)
>>>>>>> 
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index 554be94b07bd..dd2dbc256172 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>>>>>> {
>>>>>>> struct folio *folio, *t_folio;
>>>>>>> + /* First restore vmemmap for all pages on list. */
>>>>>>> + hugetlb_vmemmap_restore_folios(h, list);
>>>>>>> +
>>>>>>> list_for_each_entry_safe(folio, t_folio, list, lru) {
>>>>>>> update_and_free_hugetlb_folio(h, folio, false);
>>>>>>> cond_resched();
>>>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>>>>> index ac5577d372fe..79de984919ef 100644
>>>>>>> --- a/mm/hugetlb_vmemmap.c
>>>>>>> +++ b/mm/hugetlb_vmemmap.c
>>>>>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>>>>>> return ret;
>>>>>>> }
>>>>>>> +/*
>>>>>>> + * This function will attempt to resore vmemmap for a list of folios.  There
>>>>>>> + * is no guarantee that restoration will be successful for all or any folios.
>>>>>>> + * This is used in bulk operations, and no feedback is given to the caller.
>>>>>>> + */
>>>>>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>>>>>>> +{
>>>>>>> + struct folio *folio;
>>>>>>> +
>>>>>>> + list_for_each_entry(folio, folio_list, lru)
>>>>>>> + (void)hugetlb_vmemmap_restore(h, &folio->page);
>>>>>> 
>>>>>> I am curious about the purpose of "void" here, seems it it not necessnary,
>>>>>> ritgh? We cound see so many palces where we do not add the void if the caller
>>>>>> does not care about the return value of the callee.
>>>>> 
>>>>> Another question: should we stop restoring vmemmap pages when
>>>>> hugetlb_vmemmap_restore() fails? In which case, I suspect there
>>>>> is no memory probably, there is no need to continue, right?
>>>> 
>>>> Recall that the list of hugetlb pages may be from multiple nodes.  My first
>>>> thought was that we should continue because memory allocation may fail on one
>>>> node but succeed on another.  However, with
>>>> https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@huawei.com/
>>>> memory allocation should fall back to other nodes.  So, yes I do believe it
>>>> would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as
>>>> we are unlikely to make forward progress.
>>> 
>>> Agree.
>>> 
>>>> 
>>>> Today's behavior will try to restore vmemmap for all pages.  No stopping
>>>> on error.
>>>> 
>>>> I have mixed thoughts on this.  Quitting on error 'seems reasonable'.
>>>> However, if we continue we 'might' be able to allocate vmemmap for one
>>>> hugetlb page.  And, if we free one hugetlb page that should provide
>>>> vmemmap for several more and we may be able to free most pages on the
>>>> list.
>>> 
>>> Yes. A good point. But there should be a non-optimized huge page been
>>> freed somewhere in parallel, otherwise we still cannot allocate memory.
>> 
>> It does not have to be another huge page being freed in parallel.  It
>> could be that when allocating vmemmap for a 1G hugetlb page we were one
>> (4K) page short of what was required.  If someone else frees a 4K page,
>> freeing the next 1G page may succeed.

Right. I missed that.

>> -- 
>> Mike Kravetz
>> 
>>> However, the freeing operation happens after hugetlb_vmemmap_restore_folios.
>>> If we want to handle this, we should rework update_and_free_pages_bulk()
>>> to do a try when at least a huge pages is freed.
> 
> This seemed familiar.  Recall this patch which Muchun Reviewed and James Acked,
> https://lore.kernel.org/linux-mm/20230718004942.113174-3-mike.kravetz@oracle.com/
> 
> If we can not restore vmemmap for a page, then it must be turned into a
> surplus huge page.  In this patch (not the previous one referenced), we
> will try to restore vmemmap one more time in a later call to
> update_and_free_hugetlb_folio.  Certainly, we do not want to try twice!
> 
> My 'plan' is to include the previous patch as part of this series.  With
> that patch in place, the list_for_each_entry calls to hugetlb_vmemmap_restore
> can be replaced with a call to hugetlb_vmemmap_restore_folios.  We would
> change the behavior of hugetlb_vmemmap_restore_folios to return an error
> instead of being of type void.  If an error is returned, then we will
> make another pass through the list looking for unoptimized pages and add
> them as surplus.
> 
> I think it best if we try to restore vmemmap at least once before
> converting to a surplus page.

Make sense.

Muchun

> -- 
> Mike Kravetz


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
  2023-09-06  0:48   ` Matthew Wilcox
@ 2023-10-13 12:58   ` Naoya Horiguchi
  2023-10-13 21:43     ` Mike Kravetz
  2023-10-17  3:21     ` Mike Kravetz
  1 sibling, 2 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2023-10-13 12:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> Currently, vmemmap optimization of hugetlb pages is performed before the
> hugetlb flag (previously hugetlb destructor) is set identifying it as a
> hugetlb folio.  This means there is a window of time where an ordinary
> folio does not have all associated vmemmap present.  The core mm only
> expects vmemmap to be potentially optimized for hugetlb  and device dax.
> This can cause problems in code such as memory error handling that may
> want to write to tail struct pages.
> 
> There is only one call to perform hugetlb vmemmap optimization today.
> To fix this issue, simply set the hugetlb flag before that call.
> 
> There was a similar issue in the free hugetlb path that was previously
> addressed.  The two routines that optimize or restore hugetlb vmemmap
> should only be passed hugetlb folios/pages.  To catch any callers not
> following this rule, add VM_WARN_ON calls to the routines.  In the
> hugetlb free code paths, some calls could be made to restore vmemmap
> after clearing the hugetlb flag.  This was 'safe' as in these cases
> vmemmap was already present and the call was a NOOP.  However, for
> consistency these calls where eliminated so that we can add the
> VM_WARN_ON checks.
> 
> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

I saw that VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
memory_failure() is called on a free hugetlb page with vmemmap optimization
disabled (the warning is not triggered if vmemmap optimization is enabled).
I think that we need check folio_test_hugetlb() before dissolve_free_huge_page()
calls hugetlb_vmemmap_restore_folio().

Could you consider adding some diff like below?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2312,15 +2312,16 @@ int dissolve_free_huge_page(struct page *page)
 		 * Attempt to allocate vmemmmap here so that we can take
 		 * appropriate action on failure.
 		 */
-		rc = hugetlb_vmemmap_restore_folio(h, folio);
-		if (!rc) {
-			update_and_free_hugetlb_folio(h, folio, false);
-		} else {
-			spin_lock_irq(&hugetlb_lock);
-			add_hugetlb_folio(h, folio, false);
-			h->max_huge_pages++;
-			spin_unlock_irq(&hugetlb_lock);
+		if (folio_test_hugetlb(folio)) {
+			rc = hugetlb_vmemmap_restore_folio(h, folio);
+			if (rc) {
+				spin_lock_irq(&hugetlb_lock);
+				add_hugetlb_folio(h, folio, false);
+				h->max_huge_pages++;
+				goto out;
+			}
 		}
+		update_and_free_hugetlb_folio(h, folio, false);
 
 		return rc;
 	}


Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-10-13 12:58   ` Naoya Horiguchi
@ 2023-10-13 21:43     ` Mike Kravetz
  2023-10-16 22:55       ` Andrew Morton
  2023-10-17  3:21     ` Mike Kravetz
  1 sibling, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-10-13 21:43 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Michal Hocko, Matthew Wilcox, Xiongchun Duan

On 10/13/23 21:58, Naoya Horiguchi wrote:
> On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> > Currently, vmemmap optimization of hugetlb pages is performed before the
> > hugetlb flag (previously hugetlb destructor) is set identifying it as a
> > hugetlb folio.  This means there is a window of time where an ordinary
> > folio does not have all associated vmemmap present.  The core mm only
> > expects vmemmap to be potentially optimized for hugetlb  and device dax.
> > This can cause problems in code such as memory error handling that may
> > want to write to tail struct pages.
> > 
> > There is only one call to perform hugetlb vmemmap optimization today.
> > To fix this issue, simply set the hugetlb flag before that call.
> > 
> > There was a similar issue in the free hugetlb path that was previously
> > addressed.  The two routines that optimize or restore hugetlb vmemmap
> > should only be passed hugetlb folios/pages.  To catch any callers not
> > following this rule, add VM_WARN_ON calls to the routines.  In the
> > hugetlb free code paths, some calls could be made to restore vmemmap
> > after clearing the hugetlb flag.  This was 'safe' as in these cases
> > vmemmap was already present and the call was a NOOP.  However, for
> > consistency these calls where eliminated so that we can add the
> > VM_WARN_ON checks.
> > 
> > Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> I saw that VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
> memory_failure() is called on a free hugetlb page with vmemmap optimization
> disabled (the warning is not triggered if vmemmap optimization is enabled).
> I think that we need check folio_test_hugetlb() before dissolve_free_huge_page()
> calls hugetlb_vmemmap_restore_folio().
> 
> Could you consider adding some diff like below?

Thanks!  That case was indeed overlooked.

Andrew, this patch is currently in mm-stable.  How would you like to update?
- A new version of the patch
- An patch to the original patch
- Something else

-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2312,15 +2312,16 @@ int dissolve_free_huge_page(struct page *page)
>  		 * Attempt to allocate vmemmmap here so that we can take
>  		 * appropriate action on failure.
>  		 */
> -		rc = hugetlb_vmemmap_restore_folio(h, folio);
> -		if (!rc) {
> -			update_and_free_hugetlb_folio(h, folio, false);
> -		} else {
> -			spin_lock_irq(&hugetlb_lock);
> -			add_hugetlb_folio(h, folio, false);
> -			h->max_huge_pages++;
> -			spin_unlock_irq(&hugetlb_lock);
> +		if (folio_test_hugetlb(folio)) {
> +			rc = hugetlb_vmemmap_restore_folio(h, folio);
> +			if (rc) {
> +				spin_lock_irq(&hugetlb_lock);
> +				add_hugetlb_folio(h, folio, false);
> +				h->max_huge_pages++;
> +				goto out;
> +			}
>  		}
> +		update_and_free_hugetlb_folio(h, folio, false);
>  
>  		return rc;
>  	}
> 
> 
> Thanks,
> Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-10-13 21:43     ` Mike Kravetz
@ 2023-10-16 22:55       ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2023-10-16 22:55 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, linux-mm, linux-kernel, Muchun Song,
	Joao Martins, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan

On Fri, 13 Oct 2023 14:43:56 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > Could you consider adding some diff like below?
> 
> Thanks!  That case was indeed overlooked.
> 
> Andrew, this patch is currently in mm-stable.  How would you like to update?
> - A new version of the patch
> - An patch to the original patch
> - Something else

I guess just a fixup against what's currently in mm-stable, please.  It
happens sometimes.

Please let's get the Fixes: accurate.  I just had to rebase mm-stable
to drop Huang Ying's pcp auto-tuning series.  I'm now seeing

d8f5f7e445f0 hugetlb: set hugetlb page flag before optimizing vmemmap

which I think is what it used to be, so the rebasing shouldn't affect
this patch's hash.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-10-13 12:58   ` Naoya Horiguchi
  2023-10-13 21:43     ` Mike Kravetz
@ 2023-10-17  3:21     ` Mike Kravetz
  2023-10-18  1:58       ` Naoya Horiguchi
  1 sibling, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2023-10-17  3:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 10/13/23 21:58, Naoya Horiguchi wrote:
> On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> > 
> > Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> I saw that VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
> memory_failure() is called on a free hugetlb page with vmemmap optimization
> disabled (the warning is not triggered if vmemmap optimization is enabled).
> I think that we need check folio_test_hugetlb() before dissolve_free_huge_page()
> calls hugetlb_vmemmap_restore_folio().
> 
> Could you consider adding some diff like below?
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2312,15 +2312,16 @@ int dissolve_free_huge_page(struct page *page)
>  		 * Attempt to allocate vmemmmap here so that we can take
>  		 * appropriate action on failure.
>  		 */
> -		rc = hugetlb_vmemmap_restore_folio(h, folio);
> -		if (!rc) {
> -			update_and_free_hugetlb_folio(h, folio, false);
> -		} else {
> -			spin_lock_irq(&hugetlb_lock);
> -			add_hugetlb_folio(h, folio, false);
> -			h->max_huge_pages++;
> -			spin_unlock_irq(&hugetlb_lock);
> +		if (folio_test_hugetlb(folio)) {
> +			rc = hugetlb_vmemmap_restore_folio(h, folio);
> +			if (rc) {
> +				spin_lock_irq(&hugetlb_lock);
> +				add_hugetlb_folio(h, folio, false);
> +				h->max_huge_pages++;
> +				goto out;
> +			}
>  		}
> +		update_and_free_hugetlb_folio(h, folio, false);
>  
>  		return rc;
>  	}
> 

Hi Naoya,

I believe we need to set 'rc = 0' in the !folio_test_hugetlb().  I put
together the following patch based on mm-stable.  Please take a look.

From f19fbfab324d7d17de4a1e814f95ee655950c58e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 16 Oct 2023 19:55:49 -0700
Subject: [PATCH] hugetlb: check for hugetlb folio before vmemmap_restore

In commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before
optimizing vmemmap") checks were added to print a warning if
hugetlb_vmemmap_restore was called on a non-hugetlb page.  This
was mostly due to ordering issues in the hugetlb page set up and
 tear down sequencees.  One place missed was the routine
dissolve_free_huge_page.  Naoya Horiguchi noted: "I saw that
VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
memory_failure() is called on a free hugetlb page with vmemmap
optimization disabled (the warning is not triggered if vmemmap
optimization is enabled).  I think that we need check
folio_test_hugetlb() before dissolve_free_huge_page() calls
hugetlb_vmemmap_restore_folio()."

Perform the check as suggested by Naoya.

Fixes: d8f5f7e445f0 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
Suggested-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36b40bc9ac25..13736cbb2c19 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2290,17 +2290,23 @@ int dissolve_free_huge_page(struct page *page)
 		 * need to adjust max_huge_pages if the page is not freed.
 		 * Attempt to allocate vmemmmap here so that we can take
 		 * appropriate action on failure.
+		 *
+		 * The folio_test_hugetlb check here is because
+		 * remove_hugetlb_folio will clear hugetlb folio flag for
+		 * non-vmemmap optimized hugetlb folios.
 		 */
-		rc = hugetlb_vmemmap_restore(h, &folio->page);
-		if (!rc) {
-			update_and_free_hugetlb_folio(h, folio, false);
-		} else {
-			spin_lock_irq(&hugetlb_lock);
-			add_hugetlb_folio(h, folio, false);
-			h->max_huge_pages++;
-			spin_unlock_irq(&hugetlb_lock);
-		}
+		if (folio_test_hugetlb(folio)) {
+			rc = hugetlb_vmemmap_restore(h, &folio->page);
+			if (rc) {
+				spin_lock_irq(&hugetlb_lock);
+				add_hugetlb_folio(h, folio, false);
+				h->max_huge_pages++;
+				goto out;
+			}
+		} else
+			rc = 0;
 
+		update_and_free_hugetlb_folio(h, folio, false);
 		return rc;
 	}
 out:
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-10-17  3:21     ` Mike Kravetz
@ 2023-10-18  1:58       ` Naoya Horiguchi
  2023-10-18  3:43         ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2023-10-18  1:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On Mon, Oct 16, 2023 at 08:21:40PM -0700, Mike Kravetz wrote:
> On 10/13/23 21:58, Naoya Horiguchi wrote:
> > On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> > > 
> > > Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > I saw that VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
> > memory_failure() is called on a free hugetlb page with vmemmap optimization
> > disabled (the warning is not triggered if vmemmap optimization is enabled).
> > I think that we need check folio_test_hugetlb() before dissolve_free_huge_page()
> > calls hugetlb_vmemmap_restore_folio().
> > 
> > Could you consider adding some diff like below?
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2312,15 +2312,16 @@ int dissolve_free_huge_page(struct page *page)
> >  		 * Attempt to allocate vmemmmap here so that we can take
> >  		 * appropriate action on failure.
> >  		 */
> > -		rc = hugetlb_vmemmap_restore_folio(h, folio);
> > -		if (!rc) {
> > -			update_and_free_hugetlb_folio(h, folio, false);
> > -		} else {
> > -			spin_lock_irq(&hugetlb_lock);
> > -			add_hugetlb_folio(h, folio, false);
> > -			h->max_huge_pages++;
> > -			spin_unlock_irq(&hugetlb_lock);
> > +		if (folio_test_hugetlb(folio)) {
> > +			rc = hugetlb_vmemmap_restore_folio(h, folio);
> > +			if (rc) {
> > +				spin_lock_irq(&hugetlb_lock);
> > +				add_hugetlb_folio(h, folio, false);
> > +				h->max_huge_pages++;
> > +				goto out;
> > +			}
> >  		}
> > +		update_and_free_hugetlb_folio(h, folio, false);
> >  
> >  		return rc;
> >  	}
> > 
> 
> Hi Naoya,
> 
> I believe we need to set 'rc = 0' in the !folio_test_hugetlb().  I put
> together the following patch based on mm-stable.  Please take a look.

Hi Mike, the patch looks good to me, and I confirmed that my test programs
showed no new problem on latest mm-stable + this patch.

Thank you very much.
Naoya Horiguchi

> 
> From f19fbfab324d7d17de4a1e814f95ee655950c58e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Mon, 16 Oct 2023 19:55:49 -0700
> Subject: [PATCH] hugetlb: check for hugetlb folio before vmemmap_restore
> 
> In commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before
> optimizing vmemmap") checks were added to print a warning if
> hugetlb_vmemmap_restore was called on a non-hugetlb page.  This
> was mostly due to ordering issues in the hugetlb page set up and
>  tear down sequencees.  One place missed was the routine
> dissolve_free_huge_page.  Naoya Horiguchi noted: "I saw that
> VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
> memory_failure() is called on a free hugetlb page with vmemmap
> optimization disabled (the warning is not triggered if vmemmap
> optimization is enabled).  I think that we need check
> folio_test_hugetlb() before dissolve_free_huge_page() calls
> hugetlb_vmemmap_restore_folio()."
> 
> Perform the check as suggested by Naoya.
> 
> Fixes: d8f5f7e445f0 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
> Suggested-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 36b40bc9ac25..13736cbb2c19 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2290,17 +2290,23 @@ int dissolve_free_huge_page(struct page *page)
>  		 * need to adjust max_huge_pages if the page is not freed.
>  		 * Attempt to allocate vmemmmap here so that we can take
>  		 * appropriate action on failure.
> +		 *
> +		 * The folio_test_hugetlb check here is because
> +		 * remove_hugetlb_folio will clear hugetlb folio flag for
> +		 * non-vmemmap optimized hugetlb folios.
>  		 */
> -		rc = hugetlb_vmemmap_restore(h, &folio->page);
> -		if (!rc) {
> -			update_and_free_hugetlb_folio(h, folio, false);
> -		} else {
> -			spin_lock_irq(&hugetlb_lock);
> -			add_hugetlb_folio(h, folio, false);
> -			h->max_huge_pages++;
> -			spin_unlock_irq(&hugetlb_lock);
> -		}
> +		if (folio_test_hugetlb(folio)) {
> +			rc = hugetlb_vmemmap_restore(h, &folio->page);
> +			if (rc) {
> +				spin_lock_irq(&hugetlb_lock);
> +				add_hugetlb_folio(h, folio, false);
> +				h->max_huge_pages++;
> +				goto out;
> +			}
> +		} else
> +			rc = 0;
>  
> +		update_and_free_hugetlb_folio(h, folio, false);
>  		return rc;
>  	}
>  out:
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap
  2023-10-18  1:58       ` Naoya Horiguchi
@ 2023-10-18  3:43         ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2023-10-18  3:43 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 10/18/23 10:58, Naoya Horiguchi wrote:
> On Mon, Oct 16, 2023 at 08:21:40PM -0700, Mike Kravetz wrote:
> > On 10/13/23 21:58, Naoya Horiguchi wrote:
> > > On Tue, Sep 05, 2023 at 02:44:00PM -0700, Mike Kravetz wrote:
> > Hi Naoya,
> > 
> > I believe we need to set 'rc = 0' in the !folio_test_hugetlb().  I put
> > together the following patch based on mm-stable.  Please take a look.
> 
> Hi Mike, the patch looks good to me, and I confirmed that my test programs
> showed no new problem on latest mm-stable + this patch.
> 

Thank you for testing!

My patch could be simpler/more elegant by eliminating that else clause
and just returning 0 after update_and_free_hugetlb_folio.  Oh well, it
is functionally the same.
-- 
Mike Kravetz


> Thank you very much.
> Naoya Horiguchi
> 
> > 
> > From f19fbfab324d7d17de4a1e814f95ee655950c58e Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Mon, 16 Oct 2023 19:55:49 -0700
> > Subject: [PATCH] hugetlb: check for hugetlb folio before vmemmap_restore
> > 
> > In commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before
> > optimizing vmemmap") checks were added to print a warning if
> > hugetlb_vmemmap_restore was called on a non-hugetlb page.  This
> > was mostly due to ordering issues in the hugetlb page set up and
> >  tear down sequencees.  One place missed was the routine
> > dissolve_free_huge_page.  Naoya Horiguchi noted: "I saw that
> > VM_WARN_ON_ONCE() in hugetlb_vmemmap_restore is triggered when
> > memory_failure() is called on a free hugetlb page with vmemmap
> > optimization disabled (the warning is not triggered if vmemmap
> > optimization is enabled).  I think that we need check
> > folio_test_hugetlb() before dissolve_free_huge_page() calls
> > hugetlb_vmemmap_restore_folio()."
> > 
> > Perform the check as suggested by Naoya.
> > 
> > Fixes: d8f5f7e445f0 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
> > Suggested-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c | 24 +++++++++++++++---------
> >  1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 36b40bc9ac25..13736cbb2c19 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2290,17 +2290,23 @@ int dissolve_free_huge_page(struct page *page)
> >  		 * need to adjust max_huge_pages if the page is not freed.
> >  		 * Attempt to allocate vmemmmap here so that we can take
> >  		 * appropriate action on failure.
> > +		 *
> > +		 * The folio_test_hugetlb check here is because
> > +		 * remove_hugetlb_folio will clear hugetlb folio flag for
> > +		 * non-vmemmap optimized hugetlb folios.
> >  		 */
> > -		rc = hugetlb_vmemmap_restore(h, &folio->page);
> > -		if (!rc) {
> > -			update_and_free_hugetlb_folio(h, folio, false);
> > -		} else {
> > -			spin_lock_irq(&hugetlb_lock);
> > -			add_hugetlb_folio(h, folio, false);
> > -			h->max_huge_pages++;
> > -			spin_unlock_irq(&hugetlb_lock);
> > -		}
> > +		if (folio_test_hugetlb(folio)) {
> > +			rc = hugetlb_vmemmap_restore(h, &folio->page);
> > +			if (rc) {
> > +				spin_lock_irq(&hugetlb_lock);
> > +				add_hugetlb_folio(h, folio, false);
> > +				h->max_huge_pages++;
> > +				goto out;
> > +			}
> > +		} else
> > +			rc = 0;
> >  
> > +		update_and_free_hugetlb_folio(h, folio, false);
> >  		return rc;
> >  	}
> >  out:
> > -- 
> > 2.41.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-10-18  3:44 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-05 21:43 [PATCH v2 00/11] Batch hugetlb vmemmap modification operations Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 01/11] hugetlb: set hugetlb page flag before optimizing vmemmap Mike Kravetz
2023-09-06  0:48   ` Matthew Wilcox
2023-09-06  1:05     ` Mike Kravetz
2023-10-13 12:58   ` Naoya Horiguchi
2023-10-13 21:43     ` Mike Kravetz
2023-10-16 22:55       ` Andrew Morton
2023-10-17  3:21     ` Mike Kravetz
2023-10-18  1:58       ` Naoya Horiguchi
2023-10-18  3:43         ` Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 02/11] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 03/11] hugetlb: Remove a few calls to page_folio() Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 04/11] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 05/11] hugetlb: restructure pool allocations Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 06/11] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
2023-09-06  7:30   ` Muchun Song
2023-09-05 21:44 ` [PATCH v2 07/11] hugetlb: perform vmemmap restoration " Mike Kravetz
2023-09-06  7:33   ` Muchun Song
2023-09-06  8:07     ` Muchun Song
2023-09-06 21:12       ` Mike Kravetz
2023-09-07  3:33         ` Muchun Song
2023-09-07 18:54           ` Mike Kravetz
2023-09-08 20:53             ` Mike Kravetz
2023-09-11  3:10               ` Muchun Song
2023-09-06 20:53     ` Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 08/11] hugetlb: batch freeing of vmemmap pages Mike Kravetz
2023-09-06  7:38   ` Muchun Song
2023-09-06 21:38     ` Mike Kravetz
2023-09-07  6:19       ` Muchun Song
2023-09-07 18:47         ` Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
2023-09-06  8:24   ` Muchun Song
2023-09-06  9:11     ` [External] " Muchun Song
2023-09-06  9:26       ` Joao Martins
2023-09-06  9:32         ` [External] " Muchun Song
2023-09-06  9:44           ` Joao Martins
2023-09-06 11:34             ` Muchun Song
2023-09-06  9:13     ` Joao Martins
2023-09-05 21:44 ` [PATCH v2 10/11] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
2023-09-07  6:55   ` Muchun Song
2023-09-07 18:57     ` Mike Kravetz
2023-09-05 21:44 ` [PATCH v2 11/11] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
2023-09-07  6:58   ` Muchun Song
2023-09-07 18:58     ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.