All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability
@ 2022-06-13  6:35 Muchun Song
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

This series aims to simplify hugetlb vmemmap and improve its readability
and is based on next-20220610.

Muchun Song (6):
  mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()
  mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
  mm: hugetlb_vmemmap: introduce the name HVO
  mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to
    hugetlb_vmemmap.c
  mm: hugetlb_vmemmap: replace early_param() with core_param()
  mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability

 Documentation/admin-guide/kernel-parameters.txt |   7 +-
 Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +-
 Documentation/admin-guide/sysctl/vm.rst         |   3 +-
 arch/arm64/mm/flush.c                           |  13 +-
 fs/Kconfig                                      |  13 +-
 include/linux/hugetlb.h                         |   7 +-
 include/linux/mm.h                              |   7 -
 include/linux/page-flags.h                      |  16 +-
 mm/hugetlb.c                                    |  11 +-
 mm/hugetlb_vmemmap.c                            | 592 ++++++++++++++++++------
 mm/hugetlb_vmemmap.h                            |  43 +-
 mm/sparse-vmemmap.c                             | 391 ----------------
 12 files changed, 509 insertions(+), 597 deletions(-)


base-commit: 6d0c806803170f120f8cb97b321de7bd89d3a791
-- 
2.11.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13  8:04   ` Oscar Salvador
                     ` (2 more replies)
  2022-06-13  6:35 ` [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling Muchun Song
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song, Catalin Marinas,
	Will Deacon, Anshuman Khandual

The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
two conditions (enabled and pages in use).  Instead of coming up to
an appropriate name, we could just delete it.  There is already a
discussion about deleting it in thread [1].

There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
However, it does not need to call hugetlb_optimize_vmemmap_enabled()
in flush_dcache_page() since HugeTLB pages are always fully mapped and
only head page will be set PG_dcache_clean meaning only head page's flag
may need to be cleared (see commit cf5a501d985b).  So it is easy to
remove hugetlb_optimize_vmemmap_enabled().

Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/ [1]
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/flush.c      | 13 +++----------
 include/linux/page-flags.h | 14 ++------------
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index fc4f710e9820..5f9379b3c8c8 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -76,17 +76,10 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
 void flush_dcache_page(struct page *page)
 {
 	/*
-	 * Only the head page's flags of HugeTLB can be cleared since the tail
-	 * vmemmap pages associated with each HugeTLB page are mapped with
-	 * read-only when CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is enabled (more
-	 * details can refer to vmemmap_remap_pte()).  Although
-	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
-	 * page struct, there is more than one page struct with PG_dcache_clean
-	 * associated with the HugeTLB page since the head vmemmap page frame
-	 * is reused (more details can refer to the comments above
-	 * page_fixed_fake_head()).
+	 * HugeTLB pages are always fully mapped and only head page will be
+	 * set PG_dcache_clean (see comments in __sync_icache_dcache()).
 	 */
-	if (hugetlb_optimize_vmemmap_enabled() && PageHuge(page))
+	if (PageHuge(page))
 		page = compound_head(page);
 
 	if (test_bit(PG_dcache_clean, &page->flags))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index de80f0c26b2f..b8b992cb201c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -203,12 +203,6 @@ enum pageflags {
 DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
 			 hugetlb_optimize_vmemmap_key);
 
-static __always_inline bool hugetlb_optimize_vmemmap_enabled(void)
-{
-	return static_branch_maybe(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
-				   &hugetlb_optimize_vmemmap_key);
-}
-
 /*
  * If the feature of optimizing vmemmap pages associated with each HugeTLB
  * page is enabled, the head vmemmap page frame is reused and all of the tail
@@ -227,7 +221,8 @@ static __always_inline bool hugetlb_optimize_vmemmap_enabled(void)
  */
 static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
 {
-	if (!hugetlb_optimize_vmemmap_enabled())
+	if (!static_branch_maybe(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
+				 &hugetlb_optimize_vmemmap_key))
 		return page;
 
 	/*
@@ -255,11 +250,6 @@ static inline const struct page *page_fixed_fake_head(const struct page *page)
 {
 	return page;
 }
-
-static inline bool hugetlb_optimize_vmemmap_enabled(void)
-{
-	return false;
-}
 #endif
 
 static __always_inline int page_is_fake_head(struct page *page)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13  8:10   ` Oscar Salvador
  2022-06-13 18:28   ` Mike kravetz
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

We hold an another reference to hugetlb_optimize_vmemmap_key when
making vmemmap_optimize_mode on, because we use static_key to tell
memory_hotplug that memory_hotplug.memmap_on_memory should be
overridden.  However, this rule has gone when we have introduced
SECTION_CANNOT_OPTIMIZE_VMEMMAP.  Therefore, we could simplify
vmemmap_optimize_mode handling by not holding an another reference
to hugetlb_optimize_vmemmap_key.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/page-flags.h |  6 ++---
 mm/hugetlb_vmemmap.c       | 65 +++++-----------------------------------------
 2 files changed, 9 insertions(+), 62 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b8b992cb201c..da7ccc3b16ad 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -200,8 +200,7 @@ enum pageflags {
 #ifndef __GENERATING_BOUNDS_H
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
-			 hugetlb_optimize_vmemmap_key);
+DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 
 /*
  * If the feature of optimizing vmemmap pages associated with each HugeTLB
@@ -221,8 +220,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
  */
 static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
 {
-	if (!static_branch_maybe(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
-				 &hugetlb_optimize_vmemmap_key))
+	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
 		return page;
 
 	/*
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e20a7082f2f8..132dc83f0130 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -23,42 +23,15 @@
 #define RESERVE_VMEMMAP_NR		1U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
-enum vmemmap_optimize_mode {
-	VMEMMAP_OPTIMIZE_OFF,
-	VMEMMAP_OPTIMIZE_ON,
-};
-
-DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
-			hugetlb_optimize_vmemmap_key);
+DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 
-static enum vmemmap_optimize_mode vmemmap_optimize_mode =
+static bool vmemmap_optimize_enabled =
 	IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 
-static void vmemmap_optimize_mode_switch(enum vmemmap_optimize_mode to)
-{
-	if (vmemmap_optimize_mode == to)
-		return;
-
-	if (to == VMEMMAP_OPTIMIZE_OFF)
-		static_branch_dec(&hugetlb_optimize_vmemmap_key);
-	else
-		static_branch_inc(&hugetlb_optimize_vmemmap_key);
-	WRITE_ONCE(vmemmap_optimize_mode, to);
-}
-
 static int __init hugetlb_vmemmap_early_param(char *buf)
 {
-	bool enable;
-	enum vmemmap_optimize_mode mode;
-
-	if (kstrtobool(buf, &enable))
-		return -EINVAL;
-
-	mode = enable ? VMEMMAP_OPTIMIZE_ON : VMEMMAP_OPTIMIZE_OFF;
-	vmemmap_optimize_mode_switch(mode);
-
-	return 0;
+	return kstrtobool(buf, &vmemmap_optimize_enabled);
 }
 early_param("hugetlb_free_vmemmap", hugetlb_vmemmap_early_param);
 
@@ -103,7 +76,7 @@ static unsigned int optimizable_vmemmap_pages(struct hstate *h,
 	unsigned long pfn = page_to_pfn(head);
 	unsigned long end = pfn + pages_per_huge_page(h);
 
-	if (READ_ONCE(vmemmap_optimize_mode) == VMEMMAP_OPTIMIZE_OFF)
+	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return 0;
 
 	for (; pfn < end; pfn += PAGES_PER_SECTION) {
@@ -155,7 +128,6 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 
 	if (!is_power_of_2(sizeof(struct page))) {
 		pr_warn_once("cannot optimize vmemmap pages because \"struct page\" crosses page boundaries\n");
-		static_branch_disable(&hugetlb_optimize_vmemmap_key);
 		return;
 	}
 
@@ -176,36 +148,13 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 }
 
 #ifdef CONFIG_PROC_SYSCTL
-static int hugetlb_optimize_vmemmap_handler(struct ctl_table *table, int write,
-					    void *buffer, size_t *length,
-					    loff_t *ppos)
-{
-	int ret;
-	enum vmemmap_optimize_mode mode;
-	static DEFINE_MUTEX(sysctl_mutex);
-
-	if (write && !capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
-	mutex_lock(&sysctl_mutex);
-	mode = vmemmap_optimize_mode;
-	table->data = &mode;
-	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (write && !ret)
-		vmemmap_optimize_mode_switch(mode);
-	mutex_unlock(&sysctl_mutex);
-
-	return ret;
-}
-
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
-		.maxlen		= sizeof(enum vmemmap_optimize_mode),
+		.data		= &vmemmap_optimize_enabled,
+		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= hugetlb_optimize_vmemmap_handler,
-		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.proc_handler	= proc_dobool,
 	},
 	{ }
 };
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
  2022-06-13  6:35 ` [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13  8:13   ` Oscar Salvador
                     ` (3 more replies)
  2022-06-13  6:35 ` [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c Muchun Song
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

It it inconvenient to mention the feature of optimizing vmemmap pages associated
with HugeTLB pages when communicating with others since there is no specific or
abbreviated name for it when it is first introduced.  Let us give it a name HVO
(HugeTLB Vmemmap Optimization) from now.

This commit also updates the document about "hugetlb_free_vmemmap" by the way
discussed in thread [1].

Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 ++++---
 Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +--
 Documentation/admin-guide/sysctl/vm.rst         |  3 +--
 fs/Kconfig                                      | 13 ++++++-------
 mm/hugetlb_vmemmap.c                            |  8 ++++----
 mm/hugetlb_vmemmap.h                            |  4 ++--
 6 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 391b43fee93e..7539553b3fb0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1725,12 +1725,13 @@
 	hugetlb_free_vmemmap=
 			[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 			enabled.
+			Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
 			Allows heavy hugetlb users to free up some more
 			memory (7 * PAGE_SIZE for each 2MB hugetlb page).
-			Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
+			Format: { on | off (default) }
 
-			[oO][Nn]/Y/y/1: enable the feature
-			[oO][Ff]/N/n/0: disable the feature
+			on: enable HVO
+			off: disable HVO
 
 			Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
 			the default is on.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index a90330d0a837..64e0d5c512e7 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -164,8 +164,7 @@ default_hugepagesz
 	will all result in 256 2M huge pages being allocated.  Valid default
 	huge page size is architecture dependent.
 hugetlb_free_vmemmap
-	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
-	unused vmemmap pages associated with each HugeTLB page.
+	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.
 
 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
 indicates the current number of pre-allocated huge pages of the default size.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index d7374a1e8ac9..c9f35db973f0 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -569,8 +569,7 @@ This knob is not available when the size of 'struct page' (a structure defined
 in include/linux/mm_types.h) is not power of two (an unusual system config could
 result in this).
 
-Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
-associated with each HugeTLB page.
+Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
 
 Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
 buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
diff --git a/fs/Kconfig b/fs/Kconfig
index 5976eb33535f..2f9fd840cb66 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -247,8 +247,7 @@ config HUGETLB_PAGE
 
 #
 # Select this config option from the architecture Kconfig, if it is preferred
-# to enable the feature of minimizing overhead of struct page associated with
-# each HugeTLB page.
+# to enable the feature of HugeTLB Vmemmap Optimization (HVO).
 #
 config ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	bool
@@ -259,14 +258,14 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
-	bool "Default optimizing vmemmap pages of HugeTLB to on"
+	bool "Default HugeTLB Vmemmap Optimization (HVO) to on"
 	default n
 	depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	help
-	  When using HUGETLB_PAGE_OPTIMIZE_VMEMMAP, the optimizing unused vmemmap
-	  pages associated with each HugeTLB page is default off. Say Y here
-	  to enable optimizing vmemmap pages of HugeTLB by default. It can then
-	  be disabled on the command line via hugetlb_free_vmemmap=off.
+	  When using HUGETLB_PAGE_OPTIMIZE_VMEMMAP, the HugeTLB Vmemmap
+	  Optimization (HVO) is off by default. Say Y here to enable HVO
+	  by default. It can then be disabled on the command line via
+	  hugetlb_free_vmemmap=off or sysctl.
 
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 132dc83f0130..c10540993577 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -1,8 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Optimize vmemmap pages associated with HugeTLB
+ * HugeTLB Vmemmap Optimization (HVO)
  *
- * Copyright (c) 2020, Bytedance. All rights reserved.
+ * Copyright (c) 2020, ByteDance. All rights reserved.
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  *
@@ -120,8 +120,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 
 	/*
 	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
-	 * page structs that can be used when CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP,
-	 * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
+	 * page structs that can be used when HVO is enabled, add a BUILD_BUG_ON
+	 * to catch invalid usage of the tail page structs.
 	 */
 	BUILD_BUG_ON(__NR_USED_SUBPAGE >=
 		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 109b0a53b6fe..ba66fadad9fc 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -1,8 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Optimize vmemmap pages associated with HugeTLB
+ * HugeTLB Vmemmap Optimization (HVO)
  *
- * Copyright (c) 2020, Bytedance. All rights reserved.
+ * Copyright (c) 2020, ByteDance. All rights reserved.
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
                   ` (2 preceding siblings ...)
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13  8:15   ` Oscar Salvador
  2022-06-13 21:34   ` Mike Kravetz
  2022-06-13  6:35 ` [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param() Muchun Song
  2022-06-13  6:35 ` [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability Muchun Song
  5 siblings, 2 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

When I first introduced vmemmap manipulation functions related to HugeTLB,
I thought those functions may be reused by other modules (e.g. using
similar approach to optimize vmemmap pages, unfortunately, the DAX used the
same approach but does not use those functions).  After two years, we didn't
see any other users.  So move those functions to hugetlb_vmemmap.c.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm.h   |   7 -
 mm/hugetlb_vmemmap.c | 391 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/sparse-vmemmap.c  | 391 ---------------------------------------------------
 3 files changed, 390 insertions(+), 399 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 623c2ee8330a..152d0eefe5aa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3208,13 +3208,6 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 }
 #endif
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int vmemmap_remap_free(unsigned long start, unsigned long end,
-		       unsigned long reuse);
-int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			unsigned long reuse, gfp_t gfp_mask);
-#endif
-
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c10540993577..abdf441215bb 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -10,9 +10,31 @@
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
-#include <linux/memory_hotplug.h>
+#include <linux/pgtable.h>
+#include <linux/bootmem_info.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
 #include "hugetlb_vmemmap.h"
 
+/**
+ * struct vmemmap_remap_walk - walk vmemmap page table
+ *
+ * @remap_pte:		called for each lowest-level entry (PTE).
+ * @nr_walked:		the number of walked pte.
+ * @reuse_page:		the page which is reused for the tail vmemmap pages.
+ * @reuse_addr:		the virtual address of the @reuse_page page.
+ * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
+ *			or is mapped from.
+ */
+struct vmemmap_remap_walk {
+	void			(*remap_pte)(pte_t *pte, unsigned long addr,
+					     struct vmemmap_remap_walk *walk);
+	unsigned long		nr_walked;
+	struct page		*reuse_page;
+	unsigned long		reuse_addr;
+	struct list_head	*vmemmap_pages;
+};
+
 /*
  * There are a lot of struct page structures associated with each HugeTLB page.
  * For tail pages, the value of compound_head is the same. So we can reuse first
@@ -23,6 +45,373 @@
 #define RESERVE_VMEMMAP_NR		1U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
+static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+{
+	pmd_t __pmd;
+	int i;
+	unsigned long addr = start;
+	struct page *page = pmd_page(*pmd);
+	pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
+
+	if (!pgtable)
+		return -ENOMEM;
+
+	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
+
+	for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		pgprot_t pgprot = PAGE_KERNEL;
+
+		entry = mk_pte(page + i, pgprot);
+		pte = pte_offset_kernel(&__pmd, addr);
+		set_pte_at(&init_mm, addr, pte, entry);
+	}
+
+	spin_lock(&init_mm.page_table_lock);
+	if (likely(pmd_leaf(*pmd))) {
+		/* Make pte visible before pmd. See comment in pmd_install(). */
+		smp_wmb();
+		pmd_populate_kernel(&init_mm, pmd, pgtable);
+		flush_tlb_kernel_range(start, start + PMD_SIZE);
+	} else {
+		pte_free_kernel(&init_mm, pgtable);
+	}
+	spin_unlock(&init_mm.page_table_lock);
+
+	return 0;
+}
+
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+{
+	int leaf;
+
+	spin_lock(&init_mm.page_table_lock);
+	leaf = pmd_leaf(*pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	if (!leaf)
+		return 0;
+
+	return __split_vmemmap_huge_pmd(pmd, start);
+}
+
+static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
+			      unsigned long end,
+			      struct vmemmap_remap_walk *walk)
+{
+	pte_t *pte = pte_offset_kernel(pmd, addr);
+
+	/*
+	 * The reuse_page is found 'first' in table walk before we start
+	 * remapping (which is calling @walk->remap_pte).
+	 */
+	if (!walk->reuse_page) {
+		walk->reuse_page = pte_page(*pte);
+		/*
+		 * Because the reuse address is part of the range that we are
+		 * walking, skip the reuse address range.
+		 */
+		addr += PAGE_SIZE;
+		pte++;
+		walk->nr_walked++;
+	}
+
+	for (; addr != end; addr += PAGE_SIZE, pte++) {
+		walk->remap_pte(pte, addr, walk);
+		walk->nr_walked++;
+	}
+}
+
+static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		int ret;
+
+		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+		if (ret)
+			return ret;
+
+		next = pmd_addr_end(addr, end);
+		vmemmap_pte_range(pmd, addr, next, walk);
+	} while (pmd++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		int ret;
+
+		next = pud_addr_end(addr, end);
+		ret = vmemmap_pmd_range(pud, addr, next, walk);
+		if (ret)
+			return ret;
+	} while (pud++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		int ret;
+
+		next = p4d_addr_end(addr, end);
+		ret = vmemmap_pud_range(p4d, addr, next, walk);
+		if (ret)
+			return ret;
+	} while (p4d++, addr = next, addr != end);
+
+	return 0;
+}
+
+static int vmemmap_remap_range(unsigned long start, unsigned long end,
+			       struct vmemmap_remap_walk *walk)
+{
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgd;
+
+	VM_BUG_ON(!PAGE_ALIGNED(start));
+	VM_BUG_ON(!PAGE_ALIGNED(end));
+
+	pgd = pgd_offset_k(addr);
+	do {
+		int ret;
+
+		next = pgd_addr_end(addr, end);
+		ret = vmemmap_p4d_range(pgd, addr, next, walk);
+		if (ret)
+			return ret;
+	} while (pgd++, addr = next, addr != end);
+
+	/*
+	 * We only change the mapping of the vmemmap virtual address range
+	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
+	 * belongs to the range.
+	 */
+	flush_tlb_kernel_range(start + PAGE_SIZE, end);
+
+	return 0;
+}
+
+/*
+ * Free a vmemmap page. A vmemmap page can be allocated from the memblock
+ * allocator or buddy allocator. If the PG_reserved flag is set, it means
+ * that it allocated from the memblock allocator, just free it via the
+ * free_bootmem_page(). Otherwise, use __free_page().
+ */
+static inline void free_vmemmap_page(struct page *page)
+{
+	if (PageReserved(page))
+		free_bootmem_page(page);
+	else
+		__free_page(page);
+}
+
+/* Free a list of the vmemmap pages */
+static void free_vmemmap_page_list(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		list_del(&page->lru);
+		free_vmemmap_page(page);
+	}
+}
+
+static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
+			      struct vmemmap_remap_walk *walk)
+{
+	/*
+	 * Remap the tail pages as read-only to catch illegal write operation
+	 * to the tail pages.
+	 */
+	pgprot_t pgprot = PAGE_KERNEL_RO;
+	pte_t entry = mk_pte(walk->reuse_page, pgprot);
+	struct page *page = pte_page(*pte);
+
+	list_add_tail(&page->lru, walk->vmemmap_pages);
+	set_pte_at(&init_mm, addr, pte, entry);
+}
+
+/*
+ * How many struct page structs need to be reset. When we reuse the head
+ * struct page, the special metadata (e.g. page->flags or page->mapping)
+ * cannot copy to the tail struct page structs. The invalid value will be
+ * checked in the free_tail_pages_check(). In order to avoid the message
+ * of "corrupted mapping in tail page". We need to reset at least 3 (one
+ * head struct page struct and two tail struct page structs) struct page
+ * structs.
+ */
+#define NR_RESET_STRUCT_PAGE		3
+
+static inline void reset_struct_pages(struct page *start)
+{
+	int i;
+	struct page *from = start + NR_RESET_STRUCT_PAGE;
+
+	for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
+		memcpy(start + i, from, sizeof(*from));
+}
+
+static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
+				struct vmemmap_remap_walk *walk)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct page *page;
+	void *to;
+
+	BUG_ON(pte_page(*pte) != walk->reuse_page);
+
+	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+	list_del(&page->lru);
+	to = page_to_virt(page);
+	copy_page(to, (void *)walk->reuse_addr);
+	reset_struct_pages(to);
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+}
+
+/**
+ * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
+ *			to the page which @reuse is mapped to, then free vmemmap
+ *			which the range are mapped to.
+ * @start:	start address of the vmemmap virtual address range that we want
+ *		to remap.
+ * @end:	end address of the vmemmap virtual address range that we want to
+ *		remap.
+ * @reuse:	reuse address.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_free(unsigned long start, unsigned long end,
+			      unsigned long reuse)
+{
+	int ret;
+	LIST_HEAD(vmemmap_pages);
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= vmemmap_remap_pte,
+		.reuse_addr	= reuse,
+		.vmemmap_pages	= &vmemmap_pages,
+	};
+
+	/*
+	 * In order to make remapping routine most efficient for the huge pages,
+	 * the routine of vmemmap page table walking has the following rules
+	 * (see more details from the vmemmap_pte_range()):
+	 *
+	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
+	 *   should be continuous.
+	 * - The @reuse address is part of the range [@reuse, @end) that we are
+	 *   walking which is passed to vmemmap_remap_range().
+	 * - The @reuse address is the first in the complete range.
+	 *
+	 * So we need to make sure that @start and @reuse meet the above rules.
+	 */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	mmap_read_lock(&init_mm);
+	ret = vmemmap_remap_range(reuse, end, &walk);
+	if (ret && walk.nr_walked) {
+		end = reuse + walk.nr_walked * PAGE_SIZE;
+		/*
+		 * vmemmap_pages contains pages from the previous
+		 * vmemmap_remap_range call which failed.  These
+		 * are pages which were removed from the vmemmap.
+		 * They will be restored in the following call.
+		 */
+		walk = (struct vmemmap_remap_walk) {
+			.remap_pte	= vmemmap_restore_pte,
+			.reuse_addr	= reuse,
+			.vmemmap_pages	= &vmemmap_pages,
+		};
+
+		vmemmap_remap_range(reuse, end, &walk);
+	}
+	mmap_read_unlock(&init_mm);
+
+	free_vmemmap_page_list(&vmemmap_pages);
+
+	return ret;
+}
+
+static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
+				   gfp_t gfp_mask, struct list_head *list)
+{
+	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
+	int nid = page_to_nid((struct page *)start);
+	struct page *page, *next;
+
+	while (nr_pages--) {
+		page = alloc_pages_node(nid, gfp_mask, 0);
+		if (!page)
+			goto out;
+		list_add_tail(&page->lru, list);
+	}
+
+	return 0;
+out:
+	list_for_each_entry_safe(page, next, list, lru)
+		__free_pages(page, 0);
+	return -ENOMEM;
+}
+
+/**
+ * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
+ *			 to the page which is from the @vmemmap_pages
+ *			 respectively.
+ * @start:	start address of the vmemmap virtual address range that we want
+ *		to remap.
+ * @end:	end address of the vmemmap virtual address range that we want to
+ *		remap.
+ * @reuse:	reuse address.
+ * @gfp_mask:	GFP flag for allocating vmemmap pages.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
+			       unsigned long reuse, gfp_t gfp_mask)
+{
+	LIST_HEAD(vmemmap_pages);
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= vmemmap_restore_pte,
+		.reuse_addr	= reuse,
+		.vmemmap_pages	= &vmemmap_pages,
+	};
+
+	/* See the comment in the vmemmap_remap_free(). */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
+		return -ENOMEM;
+
+	mmap_read_lock(&init_mm);
+	vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
+
+	return 0;
+}
+
 DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 49cb15cbe590..473effcb2285 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -27,400 +27,9 @@
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
 #include <linux/sched.h>
-#include <linux/pgtable.h>
-#include <linux/bootmem_info.h>
 
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-/**
- * struct vmemmap_remap_walk - walk vmemmap page table
- *
- * @remap_pte:		called for each lowest-level entry (PTE).
- * @nr_walked:		the number of walked pte.
- * @reuse_page:		the page which is reused for the tail vmemmap pages.
- * @reuse_addr:		the virtual address of the @reuse_page page.
- * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
- *			or is mapped from.
- */
-struct vmemmap_remap_walk {
-	void (*remap_pte)(pte_t *pte, unsigned long addr,
-			  struct vmemmap_remap_walk *walk);
-	unsigned long nr_walked;
-	struct page *reuse_page;
-	unsigned long reuse_addr;
-	struct list_head *vmemmap_pages;
-};
-
-static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
-{
-	pmd_t __pmd;
-	int i;
-	unsigned long addr = start;
-	struct page *page = pmd_page(*pmd);
-	pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
-
-	if (!pgtable)
-		return -ENOMEM;
-
-	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
-
-	for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) {
-		pte_t entry, *pte;
-		pgprot_t pgprot = PAGE_KERNEL;
-
-		entry = mk_pte(page + i, pgprot);
-		pte = pte_offset_kernel(&__pmd, addr);
-		set_pte_at(&init_mm, addr, pte, entry);
-	}
-
-	spin_lock(&init_mm.page_table_lock);
-	if (likely(pmd_leaf(*pmd))) {
-		/* Make pte visible before pmd. See comment in pmd_install(). */
-		smp_wmb();
-		pmd_populate_kernel(&init_mm, pmd, pgtable);
-		flush_tlb_kernel_range(start, start + PMD_SIZE);
-	} else {
-		pte_free_kernel(&init_mm, pgtable);
-	}
-	spin_unlock(&init_mm.page_table_lock);
-
-	return 0;
-}
-
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
-{
-	int leaf;
-
-	spin_lock(&init_mm.page_table_lock);
-	leaf = pmd_leaf(*pmd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	if (!leaf)
-		return 0;
-
-	return __split_vmemmap_huge_pmd(pmd, start);
-}
-
-static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
-			      unsigned long end,
-			      struct vmemmap_remap_walk *walk)
-{
-	pte_t *pte = pte_offset_kernel(pmd, addr);
-
-	/*
-	 * The reuse_page is found 'first' in table walk before we start
-	 * remapping (which is calling @walk->remap_pte).
-	 */
-	if (!walk->reuse_page) {
-		walk->reuse_page = pte_page(*pte);
-		/*
-		 * Because the reuse address is part of the range that we are
-		 * walking, skip the reuse address range.
-		 */
-		addr += PAGE_SIZE;
-		pte++;
-		walk->nr_walked++;
-	}
-
-	for (; addr != end; addr += PAGE_SIZE, pte++) {
-		walk->remap_pte(pte, addr, walk);
-		walk->nr_walked++;
-	}
-}
-
-static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
-			     unsigned long end,
-			     struct vmemmap_remap_walk *walk)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		int ret;
-
-		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
-		if (ret)
-			return ret;
-
-		next = pmd_addr_end(addr, end);
-		vmemmap_pte_range(pmd, addr, next, walk);
-	} while (pmd++, addr = next, addr != end);
-
-	return 0;
-}
-
-static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
-			     unsigned long end,
-			     struct vmemmap_remap_walk *walk)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(p4d, addr);
-	do {
-		int ret;
-
-		next = pud_addr_end(addr, end);
-		ret = vmemmap_pmd_range(pud, addr, next, walk);
-		if (ret)
-			return ret;
-	} while (pud++, addr = next, addr != end);
-
-	return 0;
-}
-
-static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
-			     unsigned long end,
-			     struct vmemmap_remap_walk *walk)
-{
-	p4d_t *p4d;
-	unsigned long next;
-
-	p4d = p4d_offset(pgd, addr);
-	do {
-		int ret;
-
-		next = p4d_addr_end(addr, end);
-		ret = vmemmap_pud_range(p4d, addr, next, walk);
-		if (ret)
-			return ret;
-	} while (p4d++, addr = next, addr != end);
-
-	return 0;
-}
-
-static int vmemmap_remap_range(unsigned long start, unsigned long end,
-			       struct vmemmap_remap_walk *walk)
-{
-	unsigned long addr = start;
-	unsigned long next;
-	pgd_t *pgd;
-
-	VM_BUG_ON(!PAGE_ALIGNED(start));
-	VM_BUG_ON(!PAGE_ALIGNED(end));
-
-	pgd = pgd_offset_k(addr);
-	do {
-		int ret;
-
-		next = pgd_addr_end(addr, end);
-		ret = vmemmap_p4d_range(pgd, addr, next, walk);
-		if (ret)
-			return ret;
-	} while (pgd++, addr = next, addr != end);
-
-	/*
-	 * We only change the mapping of the vmemmap virtual address range
-	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
-	 * belongs to the range.
-	 */
-	flush_tlb_kernel_range(start + PAGE_SIZE, end);
-
-	return 0;
-}
-
-/*
- * Free a vmemmap page. A vmemmap page can be allocated from the memblock
- * allocator or buddy allocator. If the PG_reserved flag is set, it means
- * that it allocated from the memblock allocator, just free it via the
- * free_bootmem_page(). Otherwise, use __free_page().
- */
-static inline void free_vmemmap_page(struct page *page)
-{
-	if (PageReserved(page))
-		free_bootmem_page(page);
-	else
-		__free_page(page);
-}
-
-/* Free a list of the vmemmap pages */
-static void free_vmemmap_page_list(struct list_head *list)
-{
-	struct page *page, *next;
-
-	list_for_each_entry_safe(page, next, list, lru) {
-		list_del(&page->lru);
-		free_vmemmap_page(page);
-	}
-}
-
-static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
-			      struct vmemmap_remap_walk *walk)
-{
-	/*
-	 * Remap the tail pages as read-only to catch illegal write operation
-	 * to the tail pages.
-	 */
-	pgprot_t pgprot = PAGE_KERNEL_RO;
-	pte_t entry = mk_pte(walk->reuse_page, pgprot);
-	struct page *page = pte_page(*pte);
-
-	list_add_tail(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
-}
-
-/*
- * How many struct page structs need to be reset. When we reuse the head
- * struct page, the special metadata (e.g. page->flags or page->mapping)
- * cannot copy to the tail struct page structs. The invalid value will be
- * checked in the free_tail_pages_check(). In order to avoid the message
- * of "corrupted mapping in tail page". We need to reset at least 3 (one
- * head struct page struct and two tail struct page structs) struct page
- * structs.
- */
-#define NR_RESET_STRUCT_PAGE		3
-
-static inline void reset_struct_pages(struct page *start)
-{
-	int i;
-	struct page *from = start + NR_RESET_STRUCT_PAGE;
-
-	for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
-		memcpy(start + i, from, sizeof(*from));
-}
-
-static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
-				struct vmemmap_remap_walk *walk)
-{
-	pgprot_t pgprot = PAGE_KERNEL;
-	struct page *page;
-	void *to;
-
-	BUG_ON(pte_page(*pte) != walk->reuse_page);
-
-	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
-	list_del(&page->lru);
-	to = page_to_virt(page);
-	copy_page(to, (void *)walk->reuse_addr);
-	reset_struct_pages(to);
-
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
-}
-
-/**
- * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
- *			to the page which @reuse is mapped to, then free vmemmap
- *			which the range are mapped to.
- * @start:	start address of the vmemmap virtual address range that we want
- *		to remap.
- * @end:	end address of the vmemmap virtual address range that we want to
- *		remap.
- * @reuse:	reuse address.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-int vmemmap_remap_free(unsigned long start, unsigned long end,
-		       unsigned long reuse)
-{
-	int ret;
-	LIST_HEAD(vmemmap_pages);
-	struct vmemmap_remap_walk walk = {
-		.remap_pte	= vmemmap_remap_pte,
-		.reuse_addr	= reuse,
-		.vmemmap_pages	= &vmemmap_pages,
-	};
-
-	/*
-	 * In order to make remapping routine most efficient for the huge pages,
-	 * the routine of vmemmap page table walking has the following rules
-	 * (see more details from the vmemmap_pte_range()):
-	 *
-	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
-	 *   should be continuous.
-	 * - The @reuse address is part of the range [@reuse, @end) that we are
-	 *   walking which is passed to vmemmap_remap_range().
-	 * - The @reuse address is the first in the complete range.
-	 *
-	 * So we need to make sure that @start and @reuse meet the above rules.
-	 */
-	BUG_ON(start - reuse != PAGE_SIZE);
-
-	mmap_read_lock(&init_mm);
-	ret = vmemmap_remap_range(reuse, end, &walk);
-	if (ret && walk.nr_walked) {
-		end = reuse + walk.nr_walked * PAGE_SIZE;
-		/*
-		 * vmemmap_pages contains pages from the previous
-		 * vmemmap_remap_range call which failed.  These
-		 * are pages which were removed from the vmemmap.
-		 * They will be restored in the following call.
-		 */
-		walk = (struct vmemmap_remap_walk) {
-			.remap_pte	= vmemmap_restore_pte,
-			.reuse_addr	= reuse,
-			.vmemmap_pages	= &vmemmap_pages,
-		};
-
-		vmemmap_remap_range(reuse, end, &walk);
-	}
-	mmap_read_unlock(&init_mm);
-
-	free_vmemmap_page_list(&vmemmap_pages);
-
-	return ret;
-}
-
-static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
-				   gfp_t gfp_mask, struct list_head *list)
-{
-	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
-	int nid = page_to_nid((struct page *)start);
-	struct page *page, *next;
-
-	while (nr_pages--) {
-		page = alloc_pages_node(nid, gfp_mask, 0);
-		if (!page)
-			goto out;
-		list_add_tail(&page->lru, list);
-	}
-
-	return 0;
-out:
-	list_for_each_entry_safe(page, next, list, lru)
-		__free_pages(page, 0);
-	return -ENOMEM;
-}
-
-/**
- * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
- *			 to the page which is from the @vmemmap_pages
- *			 respectively.
- * @start:	start address of the vmemmap virtual address range that we want
- *		to remap.
- * @end:	end address of the vmemmap virtual address range that we want to
- *		remap.
- * @reuse:	reuse address.
- * @gfp_mask:	GFP flag for allocating vmemmap pages.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			unsigned long reuse, gfp_t gfp_mask)
-{
-	LIST_HEAD(vmemmap_pages);
-	struct vmemmap_remap_walk walk = {
-		.remap_pte	= vmemmap_restore_pte,
-		.reuse_addr	= reuse,
-		.vmemmap_pages	= &vmemmap_pages,
-	};
-
-	/* See the comment in the vmemmap_remap_free(). */
-	BUG_ON(start - reuse != PAGE_SIZE);
-
-	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
-		return -ENOMEM;
-
-	mmap_read_lock(&init_mm);
-	vmemmap_remap_range(reuse, end, &walk);
-	mmap_read_unlock(&init_mm);
-
-	return 0;
-}
-#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
 
 /*
  * Allocate a block of memory to be used to back the virtual memory map
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param()
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
                   ` (3 preceding siblings ...)
  2022-06-13  6:35 ` [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13 21:43   ` Mike Kravetz
  2022-06-13  6:35 ` [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability Muchun Song
  5 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

After the following commit and previous commit applied.

  78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")

There is no order requirement between the parameter of "hugetlb_free_vmemmap"
and "hugepages" since we have removed the check of whether HVO is enabled
from hugetlb_vmemmap_init(). Therefore we can safely replace early_param()
with core_param() to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index abdf441215bb..9808d32cdb9e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -415,14 +415,8 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 
-static bool vmemmap_optimize_enabled =
-	IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
-
-static int __init hugetlb_vmemmap_early_param(char *buf)
-{
-	return kstrtobool(buf, &vmemmap_optimize_enabled);
-}
-early_param("hugetlb_free_vmemmap", hugetlb_vmemmap_early_param);
+static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
+core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
 
 /*
  * Previously discarded vmemmap pages will be allocated and remapping
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
                   ` (4 preceding siblings ...)
  2022-06-13  6:35 ` [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param() Muchun Song
@ 2022-06-13  6:35 ` Muchun Song
  2022-06-13  8:33   ` Oscar Salvador
  5 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-13  6:35 UTC (permalink / raw)
  To: mike.kravetz, david, akpm, corbet
  Cc: linux-mm, linux-kernel, linux-doc, Muchun Song

There is a discussion about the name of hugetlb_vmemmap_alloc/free in
thread [1].  The suggestion suggested by David is rename "alloc/free"
to "optimize/restore" to make functionalities clearer to users,
"optimize" means the function will optimize vmemmap pages, while
"restore" means restoring its vmemmap pages discared before. This
commit does this.

Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
explicitly for vmemmap_addr but implicitly for vmemmap_end in
hugetlb_vmemmap_alloc/free.  David suggested we can compute what
hugetlb_vmemmap_init() does now at runtime.  We do not need to worry
for the overhead of computing at runtime since the calculation is
simple enough and those functions are not in a hot path.  This commit
has the following improvements:

  1) The function suffixed name ("optimize/restore") is more expressive.
  2) The logic becomes less weird in hugetlb_vmemmap_optmize/restore().
  3) The hugetlb_vmemmap_init() does not need to be exported anymore.
  4) A ->optimize_vmemmap_pages field in struct hstate is killed.
  5) There is only one place where checks is_power_of_2(sizeof(struct
     page)) instead of two places.
  6) Add more comments for hugetlb_vmemmap_optmize/restore().
  7) For external users, hugetlb_optimize_vmemmap_pages() is used for
     detecting if the HugeTLB's vmemmap pages is optimizable originally.
     In this commit, it is killed and we introduce a new helper
     hugetlb_vmemmap_optimizable() to replace it.  The name is more
     expressive.

Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ [1]
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |   7 +--
 mm/hugetlb.c            |  11 ++--
 mm/hugetlb_vmemmap.c    | 154 +++++++++++++++++++++++-------------------------
 mm/hugetlb_vmemmap.h    |  39 +++++++-----
 4 files changed, 105 insertions(+), 106 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 642a39016f9a..0b475faf9bf4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -640,9 +640,6 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-	unsigned int optimize_vmemmap_pages;
-#endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
 	struct cftype cgroup_files_dfl[8];
@@ -718,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 	return hstate_file(vma->vm_file);
 }
 
-static inline unsigned long huge_page_size(struct hstate *h)
+static inline unsigned long huge_page_size(const struct hstate *h)
 {
 	return (unsigned long)PAGE_SIZE << h->order;
 }
@@ -747,7 +744,7 @@ static inline bool hstate_is_gigantic(struct hstate *h)
 	return huge_page_order(h) >= MAX_ORDER;
 }
 
-static inline unsigned int pages_per_huge_page(struct hstate *h)
+static inline unsigned int pages_per_huge_page(const struct hstate *h)
 {
 	return 1 << h->order;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 259b9c41892f..26a5af7f0065 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1541,7 +1541,7 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	if (hugetlb_vmemmap_alloc(h, page)) {
+	if (hugetlb_vmemmap_restore(h, page)) {
 		spin_lock_irq(&hugetlb_lock);
 		/*
 		 * If we cannot allocate vmemmap pages, just refuse to free the
@@ -1627,7 +1627,7 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
 
 static inline void flush_free_hpage_work(struct hstate *h)
 {
-	if (hugetlb_optimize_vmemmap_pages(h))
+	if (hugetlb_vmemmap_optimizable(h))
 		flush_work(&free_hpage_work);
 }
 
@@ -1749,7 +1749,7 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 
 static void __prep_new_huge_page(struct hstate *h, struct page *page)
 {
-	hugetlb_vmemmap_free(h, page);
+	hugetlb_vmemmap_optimize(h, page);
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	hugetlb_set_page_subpool(page, NULL);
@@ -2122,7 +2122,7 @@ int dissolve_free_huge_page(struct page *page)
 		 * Attempt to allocate vmemmmap here so that we can take
 		 * appropriate action on failure.
 		 */
-		rc = hugetlb_vmemmap_alloc(h, head);
+		rc = hugetlb_vmemmap_restore(h, head);
 		if (!rc) {
 			/*
 			 * Move PageHWPoison flag from head page to the raw
@@ -3434,7 +3434,7 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
 	remove_hugetlb_page_for_demote(h, page, false);
 	spin_unlock_irq(&hugetlb_lock);
 
-	rc = hugetlb_vmemmap_alloc(h, page);
+	rc = hugetlb_vmemmap_restore(h, page);
 	if (rc) {
 		/* Allocation of vmemmmap failed, we can not demote page */
 		spin_lock_irq(&hugetlb_lock);
@@ -4124,7 +4124,6 @@ void __init hugetlb_add_hstate(unsigned int order)
 	h->next_nid_to_free = first_memory_node;
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-	hugetlb_vmemmap_init(h);
 
 	parsed_hstate = h;
 }
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 9808d32cdb9e..595b0cee3109 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -35,16 +35,6 @@ struct vmemmap_remap_walk {
 	struct list_head	*vmemmap_pages;
 };
 
-/*
- * There are a lot of struct page structures associated with each HugeTLB page.
- * For tail pages, the value of compound_head is the same. So we can reuse first
- * page of head page structures. We map the virtual addresses of all the pages
- * of tail page structures to the head page struct, and then free these page
- * frames. Therefore, we need to reserve one pages as vmemmap areas.
- */
-#define RESERVE_VMEMMAP_NR		1U
-#define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
-
 static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
 {
 	pmd_t __pmd;
@@ -418,32 +408,38 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
 
-/*
- * Previously discarded vmemmap pages will be allocated and remapping
- * after this function returns zero.
+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ *			     hugetlb_vmemmap_optimize()) vmemmap pages which
+ *			     will be reallocated and remapped.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
  */
-int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head)
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 {
 	int ret;
-	unsigned long vmemmap_addr = (unsigned long)head;
-	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages;
+	unsigned long vmemmap_start = (unsigned long)head;
+	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_size;
 
 	if (!HPageVmemmapOptimized(head))
 		return 0;
 
-	vmemmap_addr	+= RESERVE_VMEMMAP_SIZE;
-	vmemmap_pages	= hugetlb_optimize_vmemmap_pages(h);
-	vmemmap_end	= vmemmap_addr + (vmemmap_pages << PAGE_SHIFT);
-	vmemmap_reuse	= vmemmap_addr - PAGE_SIZE;
+	vmemmap_size	= hugetlb_vmemmap_size(h);
+	vmemmap_end	= vmemmap_start + vmemmap_size;
+	vmemmap_reuse	= vmemmap_start;
+	vmemmap_start	+= RESERVE_VMEMMAP_SIZE;
 
 	/*
-	 * The pages which the vmemmap virtual address range [@vmemmap_addr,
+	 * The pages which the vmemmap virtual address range [@vmemmap_start,
 	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
 	 * the range is mapped to the page which @vmemmap_reuse is mapped to.
 	 * When a HugeTLB page is freed to the buddy allocator, previously
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
-	ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
+	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse,
 				  GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
 	if (!ret) {
 		ClearHPageVmemmapOptimized(head);
@@ -453,84 +449,62 @@ int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head)
 	return ret;
 }
 
-static unsigned int optimizable_vmemmap_pages(struct hstate *h,
-					      struct page *head)
+/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
+static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
 	unsigned long pfn = page_to_pfn(head);
 	unsigned long end = pfn + pages_per_huge_page(h);
 
 	if (!READ_ONCE(vmemmap_optimize_enabled))
-		return 0;
+		return false;
+
+	if (!hugetlb_vmemmap_optimizable(h))
+		return false;
 
 	for (; pfn < end; pfn += PAGES_PER_SECTION) {
 		if (section_cannot_optimize_vmemmap(__pfn_to_section(pfn)))
-			return 0;
+			return false;
 	}
 
-	return hugetlb_optimize_vmemmap_pages(h);
+	return true;
 }
 
-void hugetlb_vmemmap_free(struct hstate *h, struct page *head)
+/**
+ * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be optimized.
+ *
+ * This function only tries to optimize @head's vmemmap pages and does not
+ * guarantee that the optimization will succeed after it returns. The caller
+ * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
+ * have been optimized.
+ */
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
-	unsigned long vmemmap_addr = (unsigned long)head;
-	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages;
+	unsigned long vmemmap_start = (unsigned long)head;
+	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_size;
 
-	vmemmap_pages = optimizable_vmemmap_pages(h, head);
-	if (!vmemmap_pages)
+	if (!vmemmap_should_optimize(h, head))
 		return;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
 
-	vmemmap_addr	+= RESERVE_VMEMMAP_SIZE;
-	vmemmap_end	= vmemmap_addr + (vmemmap_pages << PAGE_SHIFT);
-	vmemmap_reuse	= vmemmap_addr - PAGE_SIZE;
+	vmemmap_size	= hugetlb_vmemmap_size(h);
+	vmemmap_end	= vmemmap_start + vmemmap_size;
+	vmemmap_reuse	= vmemmap_start;
+	vmemmap_start	+= RESERVE_VMEMMAP_SIZE;
 
 	/*
-	 * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end)
+	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
 	 * to the page which @vmemmap_reuse is mapped to, then free the pages
-	 * which the range [@vmemmap_addr, @vmemmap_end] is mapped to.
+	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
 	 */
-	if (vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse))
+	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
 	else
 		SetHPageVmemmapOptimized(head);
 }
 
-void __init hugetlb_vmemmap_init(struct hstate *h)
-{
-	unsigned int nr_pages = pages_per_huge_page(h);
-	unsigned int vmemmap_pages;
-
-	/*
-	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
-	 * page structs that can be used when HVO is enabled, add a BUILD_BUG_ON
-	 * to catch invalid usage of the tail page structs.
-	 */
-	BUILD_BUG_ON(__NR_USED_SUBPAGE >=
-		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
-
-	if (!is_power_of_2(sizeof(struct page))) {
-		pr_warn_once("cannot optimize vmemmap pages because \"struct page\" crosses page boundaries\n");
-		return;
-	}
-
-	vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
-	/*
-	 * The head page is not to be freed to buddy allocator, the other tail
-	 * pages will map to the head page, so they can be freed.
-	 *
-	 * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true
-	 * on some architectures (e.g. aarch64). See Documentation/arm64/
-	 * hugetlbpage.rst for more details.
-	 */
-	if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR))
-		h->optimize_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR;
-
-	pr_info("can optimize %d vmemmap pages for %s\n",
-		h->optimize_vmemmap_pages, h->name);
-}
-
-#ifdef CONFIG_PROC_SYSCTL
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
@@ -542,16 +516,36 @@ static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{ }
 };
 
-static __init int hugetlb_vmemmap_sysctls_init(void)
+static int __init hugetlb_vmemmap_init(void)
 {
+	const struct hstate *h;
+	bool optimizable = false;
+
 	/*
-	 * If "struct page" crosses page boundaries, the vmemmap pages cannot
-	 * be optimized.
+	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
+	 * page structs that can be used when HVO is enabled.
 	 */
-	if (is_power_of_2(sizeof(struct page)))
-		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
+	BUILD_BUG_ON(__NR_USED_SUBPAGE >= RESERVE_VMEMMAP_SIZE / sizeof(struct page));
+
+	for_each_hstate(h) {
+		char buf[16];
+		unsigned int size = 0;
+
+		if (hugetlb_vmemmap_optimizable(h))
+			size = hugetlb_vmemmap_size(h) - RESERVE_VMEMMAP_SIZE;
+		optimizable = size ? true : optimizable;
+		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf,
+				sizeof(buf));
+		pr_info("%d KiB vmemmap can be optimized for a %s page\n",
+			size / SZ_1K, buf);
+	}
 
+	if (optimizable) {
+		if (IS_ENABLED(CONFIG_PROC_SYSCTL))
+			register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
+		pr_info("%d huge pages whose vmemmap are optimized at boot\n",
+			static_key_count(&hugetlb_optimize_vmemmap_key.key));
+	}
 	return 0;
 }
-late_initcall(hugetlb_vmemmap_sysctls_init);
-#endif /* CONFIG_PROC_SYSCTL */
+late_initcall(hugetlb_vmemmap_init);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index ba66fadad9fc..0af3f08cf63c 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -11,35 +11,44 @@
 #include <linux/hugetlb.h>
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head);
-void hugetlb_vmemmap_free(struct hstate *h, struct page *head);
-void hugetlb_vmemmap_init(struct hstate *h);
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
 
 /*
- * How many vmemmap pages associated with a HugeTLB page that can be
- * optimized and freed to the buddy allocator.
+ * There are a lot of struct page structures associated with each HugeTLB page.
+ * The only 'useful' information in the tail page structs is the compound_head
+ * field which is the same for all tail page structs. So we can reuse the first
+ * page frame of page structs. The virtual addresses of all the remaining pages
+ * of tail page structs will be mapped to the head page frame, and then these
+ * tail page frames are freed. Therefore, we need to reserve one page as
+ * vmemmap. See Documentation/vm/vmemmap_dedup.rst.
  */
-static inline unsigned int hugetlb_optimize_vmemmap_pages(struct hstate *h)
+#define RESERVE_VMEMMAP_SIZE		PAGE_SIZE
+
+static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
 {
-	return h->optimize_vmemmap_pages;
+	return pages_per_huge_page(h) * sizeof(struct page);
 }
-#else
-static inline int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head)
+
+static inline bool hugetlb_vmemmap_optimizable(const struct hstate *h)
 {
-	return 0;
+	if (!is_power_of_2(sizeof(struct page)))
+		return false;
+	return hugetlb_vmemmap_size(h) > RESERVE_VMEMMAP_SIZE;
 }
-
-static inline void hugetlb_vmemmap_free(struct hstate *h, struct page *head)
+#else
+static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 {
+	return 0;
 }
 
-static inline void hugetlb_vmemmap_init(struct hstate *h)
+static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 }
 
-static inline unsigned int hugetlb_optimize_vmemmap_pages(struct hstate *h)
+static inline bool hugetlb_vmemmap_optimizable(const struct hstate *h)
 {
-	return 0;
+	return false;
 }
 #endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
@ 2022-06-13  8:04   ` Oscar Salvador
  2022-06-13 18:15   ` Mike kravetz
  2022-06-20 16:57   ` Catalin Marinas
  2 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2022-06-13  8:04 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel,
	linux-doc, Catalin Marinas, Will Deacon, Anshuman Khandual

On Mon, Jun 13, 2022 at 02:35:07PM +0800, Muchun Song wrote:
> The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
> two conditions (enabled and pages in use).  Instead of coming up to
> an appropriate name, we could just delete it.  There is already a
> discussion about deleting it in thread [1].
> 
> There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
> hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
> However, it does not need to call hugetlb_optimize_vmemmap_enabled()
> in flush_dcache_page() since HugeTLB pages are always fully mapped and
> only head page will be set PG_dcache_clean meaning only head page's flag
> may need to be cleared (see commit cf5a501d985b).  So it is easy to
> remove hugetlb_optimize_vmemmap_enabled().
> 
> Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
  2022-06-13  6:35 ` [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling Muchun Song
@ 2022-06-13  8:10   ` Oscar Salvador
  2022-06-13  8:24     ` Muchun Song
  2022-06-13 18:28   ` Mike kravetz
  1 sibling, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2022-06-13  8:10 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:08PM +0800, Muchun Song wrote:
> We hold an another reference to hugetlb_optimize_vmemmap_key when
> making vmemmap_optimize_mode on, because we use static_key to tell
> memory_hotplug that memory_hotplug.memmap_on_memory should be
> overridden.  However, this rule has gone when we have introduced
> SECTION_CANNOT_OPTIMIZE_VMEMMAP.  Therefore, we could simplify
> vmemmap_optimize_mode handling by not holding an another reference
> to hugetlb_optimize_vmemmap_key.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

LGTM, and it looks way nicer, so

Reviewed-by: Oscar Salvador <osalvador@suse.de>

One question below though

> -static enum vmemmap_optimize_mode vmemmap_optimize_mode =
> +static bool vmemmap_optimize_enabled =
>  	IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);

So, by default vmemmap_optimize_enabled will be on if we have
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON, but we can always override that
via cmdline, as below, right?


>  
> -static void vmemmap_optimize_mode_switch(enum vmemmap_optimize_mode to)
> -{
> -	if (vmemmap_optimize_mode == to)
> -		return;
> -
> -	if (to == VMEMMAP_OPTIMIZE_OFF)
> -		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> -	else
> -		static_branch_inc(&hugetlb_optimize_vmemmap_key);
> -	WRITE_ONCE(vmemmap_optimize_mode, to);
> -}
> -
>  static int __init hugetlb_vmemmap_early_param(char *buf)
>  {
> -	bool enable;
> -	enum vmemmap_optimize_mode mode;
> -
> -	if (kstrtobool(buf, &enable))
> -		return -EINVAL;
> -
> -	mode = enable ? VMEMMAP_OPTIMIZE_ON : VMEMMAP_OPTIMIZE_OFF;
> -	vmemmap_optimize_mode_switch(mode);
> -
> -	return 0;
> +	return kstrtobool(buf, &vmemmap_optimize_enabled);
>  }
>  early_param("hugetlb_free_vmemmap", hugetlb_vmemmap_early_param);
>  
> @@ -103,7 +76,7 @@ static unsigned int optimizable_vmemmap_pages(struct hstate *h,
>  	unsigned long pfn = page_to_pfn(head);
>  	unsigned long end = pfn + pages_per_huge_page(h);
>  
> -	if (READ_ONCE(vmemmap_optimize_mode) == VMEMMAP_OPTIMIZE_OFF)
> +	if (!READ_ONCE(vmemmap_optimize_enabled))
>  		return 0;
>  
>  	for (; pfn < end; pfn += PAGES_PER_SECTION) {
> @@ -155,7 +128,6 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>  
>  	if (!is_power_of_2(sizeof(struct page))) {
>  		pr_warn_once("cannot optimize vmemmap pages because \"struct page\" crosses page boundaries\n");
> -		static_branch_disable(&hugetlb_optimize_vmemmap_key);
>  		return;
>  	}
>  
> @@ -176,36 +148,13 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>  }
>  
>  #ifdef CONFIG_PROC_SYSCTL
> -static int hugetlb_optimize_vmemmap_handler(struct ctl_table *table, int write,
> -					    void *buffer, size_t *length,
> -					    loff_t *ppos)
> -{
> -	int ret;
> -	enum vmemmap_optimize_mode mode;
> -	static DEFINE_MUTEX(sysctl_mutex);
> -
> -	if (write && !capable(CAP_SYS_ADMIN))
> -		return -EPERM;
> -
> -	mutex_lock(&sysctl_mutex);
> -	mode = vmemmap_optimize_mode;
> -	table->data = &mode;
> -	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> -	if (write && !ret)
> -		vmemmap_optimize_mode_switch(mode);
> -	mutex_unlock(&sysctl_mutex);
> -
> -	return ret;
> -}
> -
>  static struct ctl_table hugetlb_vmemmap_sysctls[] = {
>  	{
>  		.procname	= "hugetlb_optimize_vmemmap",
> -		.maxlen		= sizeof(enum vmemmap_optimize_mode),
> +		.data		= &vmemmap_optimize_enabled,
> +		.maxlen		= sizeof(int),
>  		.mode		= 0644,
> -		.proc_handler	= hugetlb_optimize_vmemmap_handler,
> -		.extra1		= SYSCTL_ZERO,
> -		.extra2		= SYSCTL_ONE,
> +		.proc_handler	= proc_dobool,
>  	},
>  	{ }
>  };
> -- 
> 2.11.0
> 
> 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
@ 2022-06-13  8:13   ` Oscar Salvador
  2022-06-13 15:39   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2022-06-13  8:13 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:09PM +0800, Muchun Song wrote:
> It it inconvenient to mention the feature of optimizing vmemmap pages associated
> with HugeTLB pages when communicating with others since there is no specific or
> abbreviated name for it when it is first introduced.  Let us give it a name HVO
> (HugeTLB Vmemmap Optimization) from now.
> 
> This commit also updates the document about "hugetlb_free_vmemmap" by the way
> discussed in thread [1].
> 
> Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

For the Documentation/admin-guide/kernel-parameters.txt, I think it gets much
more clear.
About the name, I do not have a strong opinion.

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  Documentation/admin-guide/kernel-parameters.txt |  7 ++++---
>  Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +--
>  Documentation/admin-guide/sysctl/vm.rst         |  3 +--
>  fs/Kconfig                                      | 13 ++++++-------
>  mm/hugetlb_vmemmap.c                            |  8 ++++----
>  mm/hugetlb_vmemmap.h                            |  4 ++--
>  6 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 391b43fee93e..7539553b3fb0 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1725,12 +1725,13 @@
>  	hugetlb_free_vmemmap=
>  			[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  			enabled.
> +			Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
>  			Allows heavy hugetlb users to free up some more
>  			memory (7 * PAGE_SIZE for each 2MB hugetlb page).
> -			Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
> +			Format: { on | off (default) }
>  
> -			[oO][Nn]/Y/y/1: enable the feature
> -			[oO][Ff]/N/n/0: disable the feature
> +			on: enable HVO
> +			off: disable HVO
>  
>  			Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
>  			the default is on.
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index a90330d0a837..64e0d5c512e7 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -164,8 +164,7 @@ default_hugepagesz
>  	will all result in 256 2M huge pages being allocated.  Valid default
>  	huge page size is architecture dependent.
>  hugetlb_free_vmemmap
> -	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
> -	unused vmemmap pages associated with each HugeTLB page.
> +	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.
>  
>  When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
>  indicates the current number of pre-allocated huge pages of the default size.
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index d7374a1e8ac9..c9f35db973f0 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -569,8 +569,7 @@ This knob is not available when the size of 'struct page' (a structure defined
>  in include/linux/mm_types.h) is not power of two (an unusual system config could
>  result in this).
>  
> -Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
> -associated with each HugeTLB page.
> +Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
>  
>  Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
>  buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 5976eb33535f..2f9fd840cb66 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -247,8 +247,7 @@ config HUGETLB_PAGE
>  
>  #
>  # Select this config option from the architecture Kconfig, if it is preferred
> -# to enable the feature of minimizing overhead of struct page associated with
> -# each HugeTLB page.
> +# to enable the feature of HugeTLB Vmemmap Optimization (HVO).
>  #
>  config ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  	bool
> @@ -259,14 +258,14 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  	depends on SPARSEMEM_VMEMMAP
>  
>  config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> -	bool "Default optimizing vmemmap pages of HugeTLB to on"
> +	bool "Default HugeTLB Vmemmap Optimization (HVO) to on"
>  	default n
>  	depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  	help
> -	  When using HUGETLB_PAGE_OPTIMIZE_VMEMMAP, the optimizing unused vmemmap
> -	  pages associated with each HugeTLB page is default off. Say Y here
> -	  to enable optimizing vmemmap pages of HugeTLB by default. It can then
> -	  be disabled on the command line via hugetlb_free_vmemmap=off.
> +	  When using HUGETLB_PAGE_OPTIMIZE_VMEMMAP, the HugeTLB Vmemmap
> +	  Optimization (HVO) is off by default. Say Y here to enable HVO
> +	  by default. It can then be disabled on the command line via
> +	  hugetlb_free_vmemmap=off or sysctl.
>  
>  config MEMFD_CREATE
>  	def_bool TMPFS || HUGETLBFS
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 132dc83f0130..c10540993577 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -1,8 +1,8 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
> - * Optimize vmemmap pages associated with HugeTLB
> + * HugeTLB Vmemmap Optimization (HVO)
>   *
> - * Copyright (c) 2020, Bytedance. All rights reserved.
> + * Copyright (c) 2020, ByteDance. All rights reserved.
>   *
>   *     Author: Muchun Song <songmuchun@bytedance.com>
>   *
> @@ -120,8 +120,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>  
>  	/*
>  	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> -	 * page structs that can be used when CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP,
> -	 * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
> +	 * page structs that can be used when HVO is enabled, add a BUILD_BUG_ON
> +	 * to catch invalid usage of the tail page structs.
>  	 */
>  	BUILD_BUG_ON(__NR_USED_SUBPAGE >=
>  		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 109b0a53b6fe..ba66fadad9fc 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -1,8 +1,8 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
> - * Optimize vmemmap pages associated with HugeTLB
> + * HugeTLB Vmemmap Optimization (HVO)
>   *
> - * Copyright (c) 2020, Bytedance. All rights reserved.
> + * Copyright (c) 2020, ByteDance. All rights reserved.
>   *
>   *     Author: Muchun Song <songmuchun@bytedance.com>
>   */
> -- 
> 2.11.0
> 
> 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  2022-06-13  6:35 ` [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c Muchun Song
@ 2022-06-13  8:15   ` Oscar Salvador
  2022-06-13 21:34   ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2022-06-13  8:15 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:10PM +0800, Muchun Song wrote:
> When I first introduced vmemmap manipulation functions related to HugeTLB,
> I thought those functions may be reused by other modules (e.g. using
> similar approach to optimize vmemmap pages, unfortunately, the DAX used the
> same approach but does not use those functions).  After two years, we didn't
> see any other users.  So move those functions to hugetlb_vmemmap.c.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  include/linux/mm.h   |   7 -
>  mm/hugetlb_vmemmap.c | 391 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/sparse-vmemmap.c  | 391 ---------------------------------------------------
>  3 files changed, 390 insertions(+), 399 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 623c2ee8330a..152d0eefe5aa 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3208,13 +3208,6 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
>  }
>  #endif
>  
> -#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> -int vmemmap_remap_free(unsigned long start, unsigned long end,
> -		       unsigned long reuse);
> -int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> -			unsigned long reuse, gfp_t gfp_mask);
> -#endif
> -
>  void *sparse_buffer_alloc(unsigned long size);
>  struct page * __populate_section_memmap(unsigned long pfn,
>  		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c10540993577..abdf441215bb 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -10,9 +10,31 @@
>   */
>  #define pr_fmt(fmt)	"HugeTLB: " fmt
>  
> -#include <linux/memory_hotplug.h>
> +#include <linux/pgtable.h>
> +#include <linux/bootmem_info.h>
> +#include <asm/pgalloc.h>
> +#include <asm/tlbflush.h>
>  #include "hugetlb_vmemmap.h"
>  
> +/**
> + * struct vmemmap_remap_walk - walk vmemmap page table
> + *
> + * @remap_pte:		called for each lowest-level entry (PTE).
> + * @nr_walked:		the number of walked pte.
> + * @reuse_page:		the page which is reused for the tail vmemmap pages.
> + * @reuse_addr:		the virtual address of the @reuse_page page.
> + * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
> + *			or is mapped from.
> + */
> +struct vmemmap_remap_walk {
> +	void			(*remap_pte)(pte_t *pte, unsigned long addr,
> +					     struct vmemmap_remap_walk *walk);
> +	unsigned long		nr_walked;
> +	struct page		*reuse_page;
> +	unsigned long		reuse_addr;
> +	struct list_head	*vmemmap_pages;
> +};
> +
>  /*
>   * There are a lot of struct page structures associated with each HugeTLB page.
>   * For tail pages, the value of compound_head is the same. So we can reuse first
> @@ -23,6 +45,373 @@
>  #define RESERVE_VMEMMAP_NR		1U
>  #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
>  
> +static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +{
> +	pmd_t __pmd;
> +	int i;
> +	unsigned long addr = start;
> +	struct page *page = pmd_page(*pmd);
> +	pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
> +
> +	if (!pgtable)
> +		return -ENOMEM;
> +
> +	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
> +
> +	for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) {
> +		pte_t entry, *pte;
> +		pgprot_t pgprot = PAGE_KERNEL;
> +
> +		entry = mk_pte(page + i, pgprot);
> +		pte = pte_offset_kernel(&__pmd, addr);
> +		set_pte_at(&init_mm, addr, pte, entry);
> +	}
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	if (likely(pmd_leaf(*pmd))) {
> +		/* Make pte visible before pmd. See comment in pmd_install(). */
> +		smp_wmb();
> +		pmd_populate_kernel(&init_mm, pmd, pgtable);
> +		flush_tlb_kernel_range(start, start + PMD_SIZE);
> +	} else {
> +		pte_free_kernel(&init_mm, pgtable);
> +	}
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return 0;
> +}
> +
> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +{
> +	int leaf;
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	leaf = pmd_leaf(*pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	if (!leaf)
> +		return 0;
> +
> +	return __split_vmemmap_huge_pmd(pmd, start);
> +}
> +
> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> +			      unsigned long end,
> +			      struct vmemmap_remap_walk *walk)
> +{
> +	pte_t *pte = pte_offset_kernel(pmd, addr);
> +
> +	/*
> +	 * The reuse_page is found 'first' in table walk before we start
> +	 * remapping (which is calling @walk->remap_pte).
> +	 */
> +	if (!walk->reuse_page) {
> +		walk->reuse_page = pte_page(*pte);
> +		/*
> +		 * Because the reuse address is part of the range that we are
> +		 * walking, skip the reuse address range.
> +		 */
> +		addr += PAGE_SIZE;
> +		pte++;
> +		walk->nr_walked++;
> +	}
> +
> +	for (; addr != end; addr += PAGE_SIZE, pte++) {
> +		walk->remap_pte(pte, addr, walk);
> +		walk->nr_walked++;
> +	}
> +}
> +
> +static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> +			     unsigned long end,
> +			     struct vmemmap_remap_walk *walk)
> +{
> +	pmd_t *pmd;
> +	unsigned long next;
> +
> +	pmd = pmd_offset(pud, addr);
> +	do {
> +		int ret;
> +
> +		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> +		if (ret)
> +			return ret;
> +
> +		next = pmd_addr_end(addr, end);
> +		vmemmap_pte_range(pmd, addr, next, walk);
> +	} while (pmd++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> +			     unsigned long end,
> +			     struct vmemmap_remap_walk *walk)
> +{
> +	pud_t *pud;
> +	unsigned long next;
> +
> +	pud = pud_offset(p4d, addr);
> +	do {
> +		int ret;
> +
> +		next = pud_addr_end(addr, end);
> +		ret = vmemmap_pmd_range(pud, addr, next, walk);
> +		if (ret)
> +			return ret;
> +	} while (pud++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> +			     unsigned long end,
> +			     struct vmemmap_remap_walk *walk)
> +{
> +	p4d_t *p4d;
> +	unsigned long next;
> +
> +	p4d = p4d_offset(pgd, addr);
> +	do {
> +		int ret;
> +
> +		next = p4d_addr_end(addr, end);
> +		ret = vmemmap_pud_range(p4d, addr, next, walk);
> +		if (ret)
> +			return ret;
> +	} while (p4d++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static int vmemmap_remap_range(unsigned long start, unsigned long end,
> +			       struct vmemmap_remap_walk *walk)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	pgd_t *pgd;
> +
> +	VM_BUG_ON(!PAGE_ALIGNED(start));
> +	VM_BUG_ON(!PAGE_ALIGNED(end));
> +
> +	pgd = pgd_offset_k(addr);
> +	do {
> +		int ret;
> +
> +		next = pgd_addr_end(addr, end);
> +		ret = vmemmap_p4d_range(pgd, addr, next, walk);
> +		if (ret)
> +			return ret;
> +	} while (pgd++, addr = next, addr != end);
> +
> +	/*
> +	 * We only change the mapping of the vmemmap virtual address range
> +	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
> +	 * belongs to the range.
> +	 */
> +	flush_tlb_kernel_range(start + PAGE_SIZE, end);
> +
> +	return 0;
> +}
> +
> +/*
> + * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> + * allocator or buddy allocator. If the PG_reserved flag is set, it means
> + * that it allocated from the memblock allocator, just free it via the
> + * free_bootmem_page(). Otherwise, use __free_page().
> + */
> +static inline void free_vmemmap_page(struct page *page)
> +{
> +	if (PageReserved(page))
> +		free_bootmem_page(page);
> +	else
> +		__free_page(page);
> +}
> +
> +/* Free a list of the vmemmap pages */
> +static void free_vmemmap_page_list(struct list_head *list)
> +{
> +	struct page *page, *next;
> +
> +	list_for_each_entry_safe(page, next, list, lru) {
> +		list_del(&page->lru);
> +		free_vmemmap_page(page);
> +	}
> +}
> +
> +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> +			      struct vmemmap_remap_walk *walk)
> +{
> +	/*
> +	 * Remap the tail pages as read-only to catch illegal write operation
> +	 * to the tail pages.
> +	 */
> +	pgprot_t pgprot = PAGE_KERNEL_RO;
> +	pte_t entry = mk_pte(walk->reuse_page, pgprot);
> +	struct page *page = pte_page(*pte);
> +
> +	list_add_tail(&page->lru, walk->vmemmap_pages);
> +	set_pte_at(&init_mm, addr, pte, entry);
> +}
> +
> +/*
> + * How many struct page structs need to be reset. When we reuse the head
> + * struct page, the special metadata (e.g. page->flags or page->mapping)
> + * cannot copy to the tail struct page structs. The invalid value will be
> + * checked in the free_tail_pages_check(). In order to avoid the message
> + * of "corrupted mapping in tail page". We need to reset at least 3 (one
> + * head struct page struct and two tail struct page structs) struct page
> + * structs.
> + */
> +#define NR_RESET_STRUCT_PAGE		3
> +
> +static inline void reset_struct_pages(struct page *start)
> +{
> +	int i;
> +	struct page *from = start + NR_RESET_STRUCT_PAGE;
> +
> +	for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
> +		memcpy(start + i, from, sizeof(*from));
> +}
> +
> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> +				struct vmemmap_remap_walk *walk)
> +{
> +	pgprot_t pgprot = PAGE_KERNEL;
> +	struct page *page;
> +	void *to;
> +
> +	BUG_ON(pte_page(*pte) != walk->reuse_page);
> +
> +	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> +	list_del(&page->lru);
> +	to = page_to_virt(page);
> +	copy_page(to, (void *)walk->reuse_addr);
> +	reset_struct_pages(to);
> +
> +	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> +}
> +
> +/**
> + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> + *			to the page which @reuse is mapped to, then free vmemmap
> + *			which the range are mapped to.
> + * @start:	start address of the vmemmap virtual address range that we want
> + *		to remap.
> + * @end:	end address of the vmemmap virtual address range that we want to
> + *		remap.
> + * @reuse:	reuse address.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_free(unsigned long start, unsigned long end,
> +			      unsigned long reuse)
> +{
> +	int ret;
> +	LIST_HEAD(vmemmap_pages);
> +	struct vmemmap_remap_walk walk = {
> +		.remap_pte	= vmemmap_remap_pte,
> +		.reuse_addr	= reuse,
> +		.vmemmap_pages	= &vmemmap_pages,
> +	};
> +
> +	/*
> +	 * In order to make remapping routine most efficient for the huge pages,
> +	 * the routine of vmemmap page table walking has the following rules
> +	 * (see more details from the vmemmap_pte_range()):
> +	 *
> +	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> +	 *   should be continuous.
> +	 * - The @reuse address is part of the range [@reuse, @end) that we are
> +	 *   walking which is passed to vmemmap_remap_range().
> +	 * - The @reuse address is the first in the complete range.
> +	 *
> +	 * So we need to make sure that @start and @reuse meet the above rules.
> +	 */
> +	BUG_ON(start - reuse != PAGE_SIZE);
> +
> +	mmap_read_lock(&init_mm);
> +	ret = vmemmap_remap_range(reuse, end, &walk);
> +	if (ret && walk.nr_walked) {
> +		end = reuse + walk.nr_walked * PAGE_SIZE;
> +		/*
> +		 * vmemmap_pages contains pages from the previous
> +		 * vmemmap_remap_range call which failed.  These
> +		 * are pages which were removed from the vmemmap.
> +		 * They will be restored in the following call.
> +		 */
> +		walk = (struct vmemmap_remap_walk) {
> +			.remap_pte	= vmemmap_restore_pte,
> +			.reuse_addr	= reuse,
> +			.vmemmap_pages	= &vmemmap_pages,
> +		};
> +
> +		vmemmap_remap_range(reuse, end, &walk);
> +	}
> +	mmap_read_unlock(&init_mm);
> +
> +	free_vmemmap_page_list(&vmemmap_pages);
> +
> +	return ret;
> +}
> +
> +static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> +				   gfp_t gfp_mask, struct list_head *list)
> +{
> +	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> +	int nid = page_to_nid((struct page *)start);
> +	struct page *page, *next;
> +
> +	while (nr_pages--) {
> +		page = alloc_pages_node(nid, gfp_mask, 0);
> +		if (!page)
> +			goto out;
> +		list_add_tail(&page->lru, list);
> +	}
> +
> +	return 0;
> +out:
> +	list_for_each_entry_safe(page, next, list, lru)
> +		__free_pages(page, 0);
> +	return -ENOMEM;
> +}
> +
> +/**
> + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> + *			 to the page which is from the @vmemmap_pages
> + *			 respectively.
> + * @start:	start address of the vmemmap virtual address range that we want
> + *		to remap.
> + * @end:	end address of the vmemmap virtual address range that we want to
> + *		remap.
> + * @reuse:	reuse address.
> + * @gfp_mask:	GFP flag for allocating vmemmap pages.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> +			       unsigned long reuse, gfp_t gfp_mask)
> +{
> +	LIST_HEAD(vmemmap_pages);
> +	struct vmemmap_remap_walk walk = {
> +		.remap_pte	= vmemmap_restore_pte,
> +		.reuse_addr	= reuse,
> +		.vmemmap_pages	= &vmemmap_pages,
> +	};
> +
> +	/* See the comment in the vmemmap_remap_free(). */
> +	BUG_ON(start - reuse != PAGE_SIZE);
> +
> +	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
> +		return -ENOMEM;
> +
> +	mmap_read_lock(&init_mm);
> +	vmemmap_remap_range(reuse, end, &walk);
> +	mmap_read_unlock(&init_mm);
> +
> +	return 0;
> +}
> +
>  DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
>  EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
>  
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 49cb15cbe590..473effcb2285 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -27,400 +27,9 @@
>  #include <linux/spinlock.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched.h>
> -#include <linux/pgtable.h>
> -#include <linux/bootmem_info.h>
>  
>  #include <asm/dma.h>
>  #include <asm/pgalloc.h>
> -#include <asm/tlbflush.h>
> -
> -#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> -/**
> - * struct vmemmap_remap_walk - walk vmemmap page table
> - *
> - * @remap_pte:		called for each lowest-level entry (PTE).
> - * @nr_walked:		the number of walked pte.
> - * @reuse_page:		the page which is reused for the tail vmemmap pages.
> - * @reuse_addr:		the virtual address of the @reuse_page page.
> - * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
> - *			or is mapped from.
> - */
> -struct vmemmap_remap_walk {
> -	void (*remap_pte)(pte_t *pte, unsigned long addr,
> -			  struct vmemmap_remap_walk *walk);
> -	unsigned long nr_walked;
> -	struct page *reuse_page;
> -	unsigned long reuse_addr;
> -	struct list_head *vmemmap_pages;
> -};
> -
> -static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> -{
> -	pmd_t __pmd;
> -	int i;
> -	unsigned long addr = start;
> -	struct page *page = pmd_page(*pmd);
> -	pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
> -
> -	if (!pgtable)
> -		return -ENOMEM;
> -
> -	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
> -
> -	for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) {
> -		pte_t entry, *pte;
> -		pgprot_t pgprot = PAGE_KERNEL;
> -
> -		entry = mk_pte(page + i, pgprot);
> -		pte = pte_offset_kernel(&__pmd, addr);
> -		set_pte_at(&init_mm, addr, pte, entry);
> -	}
> -
> -	spin_lock(&init_mm.page_table_lock);
> -	if (likely(pmd_leaf(*pmd))) {
> -		/* Make pte visible before pmd. See comment in pmd_install(). */
> -		smp_wmb();
> -		pmd_populate_kernel(&init_mm, pmd, pgtable);
> -		flush_tlb_kernel_range(start, start + PMD_SIZE);
> -	} else {
> -		pte_free_kernel(&init_mm, pgtable);
> -	}
> -	spin_unlock(&init_mm.page_table_lock);
> -
> -	return 0;
> -}
> -
> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> -{
> -	int leaf;
> -
> -	spin_lock(&init_mm.page_table_lock);
> -	leaf = pmd_leaf(*pmd);
> -	spin_unlock(&init_mm.page_table_lock);
> -
> -	if (!leaf)
> -		return 0;
> -
> -	return __split_vmemmap_huge_pmd(pmd, start);
> -}
> -
> -static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> -			      unsigned long end,
> -			      struct vmemmap_remap_walk *walk)
> -{
> -	pte_t *pte = pte_offset_kernel(pmd, addr);
> -
> -	/*
> -	 * The reuse_page is found 'first' in table walk before we start
> -	 * remapping (which is calling @walk->remap_pte).
> -	 */
> -	if (!walk->reuse_page) {
> -		walk->reuse_page = pte_page(*pte);
> -		/*
> -		 * Because the reuse address is part of the range that we are
> -		 * walking, skip the reuse address range.
> -		 */
> -		addr += PAGE_SIZE;
> -		pte++;
> -		walk->nr_walked++;
> -	}
> -
> -	for (; addr != end; addr += PAGE_SIZE, pte++) {
> -		walk->remap_pte(pte, addr, walk);
> -		walk->nr_walked++;
> -	}
> -}
> -
> -static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> -			     unsigned long end,
> -			     struct vmemmap_remap_walk *walk)
> -{
> -	pmd_t *pmd;
> -	unsigned long next;
> -
> -	pmd = pmd_offset(pud, addr);
> -	do {
> -		int ret;
> -
> -		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> -		if (ret)
> -			return ret;
> -
> -		next = pmd_addr_end(addr, end);
> -		vmemmap_pte_range(pmd, addr, next, walk);
> -	} while (pmd++, addr = next, addr != end);
> -
> -	return 0;
> -}
> -
> -static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> -			     unsigned long end,
> -			     struct vmemmap_remap_walk *walk)
> -{
> -	pud_t *pud;
> -	unsigned long next;
> -
> -	pud = pud_offset(p4d, addr);
> -	do {
> -		int ret;
> -
> -		next = pud_addr_end(addr, end);
> -		ret = vmemmap_pmd_range(pud, addr, next, walk);
> -		if (ret)
> -			return ret;
> -	} while (pud++, addr = next, addr != end);
> -
> -	return 0;
> -}
> -
> -static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> -			     unsigned long end,
> -			     struct vmemmap_remap_walk *walk)
> -{
> -	p4d_t *p4d;
> -	unsigned long next;
> -
> -	p4d = p4d_offset(pgd, addr);
> -	do {
> -		int ret;
> -
> -		next = p4d_addr_end(addr, end);
> -		ret = vmemmap_pud_range(p4d, addr, next, walk);
> -		if (ret)
> -			return ret;
> -	} while (p4d++, addr = next, addr != end);
> -
> -	return 0;
> -}
> -
> -static int vmemmap_remap_range(unsigned long start, unsigned long end,
> -			       struct vmemmap_remap_walk *walk)
> -{
> -	unsigned long addr = start;
> -	unsigned long next;
> -	pgd_t *pgd;
> -
> -	VM_BUG_ON(!PAGE_ALIGNED(start));
> -	VM_BUG_ON(!PAGE_ALIGNED(end));
> -
> -	pgd = pgd_offset_k(addr);
> -	do {
> -		int ret;
> -
> -		next = pgd_addr_end(addr, end);
> -		ret = vmemmap_p4d_range(pgd, addr, next, walk);
> -		if (ret)
> -			return ret;
> -	} while (pgd++, addr = next, addr != end);
> -
> -	/*
> -	 * We only change the mapping of the vmemmap virtual address range
> -	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
> -	 * belongs to the range.
> -	 */
> -	flush_tlb_kernel_range(start + PAGE_SIZE, end);
> -
> -	return 0;
> -}
> -
> -/*
> - * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> - * allocator or buddy allocator. If the PG_reserved flag is set, it means
> - * that it allocated from the memblock allocator, just free it via the
> - * free_bootmem_page(). Otherwise, use __free_page().
> - */
> -static inline void free_vmemmap_page(struct page *page)
> -{
> -	if (PageReserved(page))
> -		free_bootmem_page(page);
> -	else
> -		__free_page(page);
> -}
> -
> -/* Free a list of the vmemmap pages */
> -static void free_vmemmap_page_list(struct list_head *list)
> -{
> -	struct page *page, *next;
> -
> -	list_for_each_entry_safe(page, next, list, lru) {
> -		list_del(&page->lru);
> -		free_vmemmap_page(page);
> -	}
> -}
> -
> -static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> -			      struct vmemmap_remap_walk *walk)
> -{
> -	/*
> -	 * Remap the tail pages as read-only to catch illegal write operation
> -	 * to the tail pages.
> -	 */
> -	pgprot_t pgprot = PAGE_KERNEL_RO;
> -	pte_t entry = mk_pte(walk->reuse_page, pgprot);
> -	struct page *page = pte_page(*pte);
> -
> -	list_add_tail(&page->lru, walk->vmemmap_pages);
> -	set_pte_at(&init_mm, addr, pte, entry);
> -}
> -
> -/*
> - * How many struct page structs need to be reset. When we reuse the head
> - * struct page, the special metadata (e.g. page->flags or page->mapping)
> - * cannot copy to the tail struct page structs. The invalid value will be
> - * checked in the free_tail_pages_check(). In order to avoid the message
> - * of "corrupted mapping in tail page". We need to reset at least 3 (one
> - * head struct page struct and two tail struct page structs) struct page
> - * structs.
> - */
> -#define NR_RESET_STRUCT_PAGE		3
> -
> -static inline void reset_struct_pages(struct page *start)
> -{
> -	int i;
> -	struct page *from = start + NR_RESET_STRUCT_PAGE;
> -
> -	for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
> -		memcpy(start + i, from, sizeof(*from));
> -}
> -
> -static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> -				struct vmemmap_remap_walk *walk)
> -{
> -	pgprot_t pgprot = PAGE_KERNEL;
> -	struct page *page;
> -	void *to;
> -
> -	BUG_ON(pte_page(*pte) != walk->reuse_page);
> -
> -	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> -	list_del(&page->lru);
> -	to = page_to_virt(page);
> -	copy_page(to, (void *)walk->reuse_addr);
> -	reset_struct_pages(to);
> -
> -	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> -}
> -
> -/**
> - * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> - *			to the page which @reuse is mapped to, then free vmemmap
> - *			which the range are mapped to.
> - * @start:	start address of the vmemmap virtual address range that we want
> - *		to remap.
> - * @end:	end address of the vmemmap virtual address range that we want to
> - *		remap.
> - * @reuse:	reuse address.
> - *
> - * Return: %0 on success, negative error code otherwise.
> - */
> -int vmemmap_remap_free(unsigned long start, unsigned long end,
> -		       unsigned long reuse)
> -{
> -	int ret;
> -	LIST_HEAD(vmemmap_pages);
> -	struct vmemmap_remap_walk walk = {
> -		.remap_pte	= vmemmap_remap_pte,
> -		.reuse_addr	= reuse,
> -		.vmemmap_pages	= &vmemmap_pages,
> -	};
> -
> -	/*
> -	 * In order to make remapping routine most efficient for the huge pages,
> -	 * the routine of vmemmap page table walking has the following rules
> -	 * (see more details from the vmemmap_pte_range()):
> -	 *
> -	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> -	 *   should be continuous.
> -	 * - The @reuse address is part of the range [@reuse, @end) that we are
> -	 *   walking which is passed to vmemmap_remap_range().
> -	 * - The @reuse address is the first in the complete range.
> -	 *
> -	 * So we need to make sure that @start and @reuse meet the above rules.
> -	 */
> -	BUG_ON(start - reuse != PAGE_SIZE);
> -
> -	mmap_read_lock(&init_mm);
> -	ret = vmemmap_remap_range(reuse, end, &walk);
> -	if (ret && walk.nr_walked) {
> -		end = reuse + walk.nr_walked * PAGE_SIZE;
> -		/*
> -		 * vmemmap_pages contains pages from the previous
> -		 * vmemmap_remap_range call which failed.  These
> -		 * are pages which were removed from the vmemmap.
> -		 * They will be restored in the following call.
> -		 */
> -		walk = (struct vmemmap_remap_walk) {
> -			.remap_pte	= vmemmap_restore_pte,
> -			.reuse_addr	= reuse,
> -			.vmemmap_pages	= &vmemmap_pages,
> -		};
> -
> -		vmemmap_remap_range(reuse, end, &walk);
> -	}
> -	mmap_read_unlock(&init_mm);
> -
> -	free_vmemmap_page_list(&vmemmap_pages);
> -
> -	return ret;
> -}
> -
> -static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> -				   gfp_t gfp_mask, struct list_head *list)
> -{
> -	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> -	int nid = page_to_nid((struct page *)start);
> -	struct page *page, *next;
> -
> -	while (nr_pages--) {
> -		page = alloc_pages_node(nid, gfp_mask, 0);
> -		if (!page)
> -			goto out;
> -		list_add_tail(&page->lru, list);
> -	}
> -
> -	return 0;
> -out:
> -	list_for_each_entry_safe(page, next, list, lru)
> -		__free_pages(page, 0);
> -	return -ENOMEM;
> -}
> -
> -/**
> - * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> - *			 to the page which is from the @vmemmap_pages
> - *			 respectively.
> - * @start:	start address of the vmemmap virtual address range that we want
> - *		to remap.
> - * @end:	end address of the vmemmap virtual address range that we want to
> - *		remap.
> - * @reuse:	reuse address.
> - * @gfp_mask:	GFP flag for allocating vmemmap pages.
> - *
> - * Return: %0 on success, negative error code otherwise.
> - */
> -int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> -			unsigned long reuse, gfp_t gfp_mask)
> -{
> -	LIST_HEAD(vmemmap_pages);
> -	struct vmemmap_remap_walk walk = {
> -		.remap_pte	= vmemmap_restore_pte,
> -		.reuse_addr	= reuse,
> -		.vmemmap_pages	= &vmemmap_pages,
> -	};
> -
> -	/* See the comment in the vmemmap_remap_free(). */
> -	BUG_ON(start - reuse != PAGE_SIZE);
> -
> -	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
> -		return -ENOMEM;
> -
> -	mmap_read_lock(&init_mm);
> -	vmemmap_remap_range(reuse, end, &walk);
> -	mmap_read_unlock(&init_mm);
> -
> -	return 0;
> -}
> -#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
>  
>  /*
>   * Allocate a block of memory to be used to back the virtual memory map
> -- 
> 2.11.0
> 
> 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
  2022-06-13  8:10   ` Oscar Salvador
@ 2022-06-13  8:24     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-13  8:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 10:10:08AM +0200, Oscar Salvador wrote:
> On Mon, Jun 13, 2022 at 02:35:08PM +0800, Muchun Song wrote:
> > We hold an another reference to hugetlb_optimize_vmemmap_key when
> > making vmemmap_optimize_mode on, because we use static_key to tell
> > memory_hotplug that memory_hotplug.memmap_on_memory should be
> > overridden.  However, this rule has gone when we have introduced
> > SECTION_CANNOT_OPTIMIZE_VMEMMAP.  Therefore, we could simplify
> > vmemmap_optimize_mode handling by not holding an another reference
> > to hugetlb_optimize_vmemmap_key.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> LGTM, and it looks way nicer, so
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>

Thanks for taking a look.
 
> One question below though
> 
> > -static enum vmemmap_optimize_mode vmemmap_optimize_mode =
> > +static bool vmemmap_optimize_enabled =
> >  	IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
> 
> So, by default vmemmap_optimize_enabled will be on if we have
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON, but we can always override that
> via cmdline, as below, right?
> 

Totally right. CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON only control
if the feature is enabled by default when the users do not specify "it should
be off" via cmdline. 

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-13  6:35 ` [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability Muchun Song
@ 2022-06-13  8:33   ` Oscar Salvador
  2022-06-13  9:01     ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2022-06-13  8:33 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
> -static __init int hugetlb_vmemmap_sysctls_init(void)
> +static int __init hugetlb_vmemmap_init(void)
>  {
> +	const struct hstate *h;
> +	bool optimizable = false;
> +
>  	/*
> -	 * If "struct page" crosses page boundaries, the vmemmap pages cannot
> -	 * be optimized.
> +	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> +	 * page structs that can be used when HVO is enabled.
>  	 */
> -	if (is_power_of_2(sizeof(struct page)))
> -		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
> +	BUILD_BUG_ON(__NR_USED_SUBPAGE >= RESERVE_VMEMMAP_SIZE / sizeof(struct page));

I need to take another look, but from the first glance there is something
here that caught my eye.

> +
> +	for_each_hstate(h) {
> +		char buf[16];
> +		unsigned int size = 0;
> +
> +		if (hugetlb_vmemmap_optimizable(h))
> +			size = hugetlb_vmemmap_size(h) - RESERVE_VMEMMAP_SIZE;
> +		optimizable = size ? true : optimizable;

This feels weird, just use false instead of optimizable.

> +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf,
> +				sizeof(buf));
> +		pr_info("%d KiB vmemmap can be optimized for a %s page\n",
> +			size / SZ_1K, buf);

I do not have a strong opinion but I wonder whether this brings a lot.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-13  8:33   ` Oscar Salvador
@ 2022-06-13  9:01     ` Muchun Song
  2022-06-14  0:22       ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-13  9:01 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 10:33:48AM +0200, Oscar Salvador wrote:
> On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
> > -static __init int hugetlb_vmemmap_sysctls_init(void)
> > +static int __init hugetlb_vmemmap_init(void)
> >  {
> > +	const struct hstate *h;
> > +	bool optimizable = false;
> > +
> >  	/*
> > -	 * If "struct page" crosses page boundaries, the vmemmap pages cannot
> > -	 * be optimized.
> > +	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> > +	 * page structs that can be used when HVO is enabled.
> >  	 */
> > -	if (is_power_of_2(sizeof(struct page)))
> > -		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
> > +	BUILD_BUG_ON(__NR_USED_SUBPAGE >= RESERVE_VMEMMAP_SIZE / sizeof(struct page));
> 
> I need to take another look, but from the first glance there is something
> here that caught my eye.
>

Thanks for taking a look. This is introduced in commit f41f2ed43ca5.
 
> > +
> > +	for_each_hstate(h) {
> > +		char buf[16];
> > +		unsigned int size = 0;
> > +
> > +		if (hugetlb_vmemmap_optimizable(h))
> > +			size = hugetlb_vmemmap_size(h) - RESERVE_VMEMMAP_SIZE;
> > +		optimizable = size ? true : optimizable;
> 
> This feels weird, just use false instead of optimizable.
>

This is a loop, we shoud keep "optimizable" as "true" as long as there is one
hstate is optimizable. How about:

  if (size)
	optimizable = true;

> > +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf,
> > +				sizeof(buf));
> > +		pr_info("%d KiB vmemmap can be optimized for a %s page\n",
> > +			size / SZ_1K, buf);
> 
> I do not have a strong opinion but I wonder whether this brings a lot.
>

I thought the users can know what size HugeTLB is optimizable via
this log.  E.g. On aarch64, 64KB HugeTLB cannot be optimizable.
I do not have a strong opinion as well, if anyone think it is
unnecessary, I'll drop it in next version.

Thanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
  2022-06-13  8:13   ` Oscar Salvador
@ 2022-06-13 15:39   ` David Hildenbrand
  2022-06-14  3:15     ` Muchun Song
  2022-06-13 21:19   ` Mike Kravetz
  2022-06-15 14:51   ` Joao Martins
  3 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2022-06-13 15:39 UTC (permalink / raw)
  To: Muchun Song, mike.kravetz, akpm, corbet; +Cc: linux-mm, linux-kernel, linux-doc

On 13.06.22 08:35, Muchun Song wrote:
> It it inconvenient to mention the feature of optimizing vmemmap pages associated
> with HugeTLB pages when communicating with others since there is no specific or
> abbreviated name for it when it is first introduced.  Let us give it a name HVO
> (HugeTLB Vmemmap Optimization) from now.
> 
> This commit also updates the document about "hugetlb_free_vmemmap" by the way
> discussed in thread [1].
> 
> Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  7 ++++---
>  Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +--
>  Documentation/admin-guide/sysctl/vm.rst         |  3 +--
>  fs/Kconfig                                      | 13 ++++++-------
>  mm/hugetlb_vmemmap.c                            |  8 ++++----
>  mm/hugetlb_vmemmap.h                            |  4 ++--
>  6 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 391b43fee93e..7539553b3fb0 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1725,12 +1725,13 @@
>  	hugetlb_free_vmemmap=
>  			[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  			enabled.
> +			Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
>  			Allows heavy hugetlb users to free up some more
>  			memory (7 * PAGE_SIZE for each 2MB hugetlb page).
> -			Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
> +			Format: { on | off (default) }
>  
> -			[oO][Nn]/Y/y/1: enable the feature
> -			[oO][Ff]/N/n/0: disable the feature
> +			on: enable HVO
> +			off: disable HVO
>  
>  			Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
>  			the default is on.
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index a90330d0a837..64e0d5c512e7 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -164,8 +164,7 @@ default_hugepagesz
>  	will all result in 256 2M huge pages being allocated.  Valid default
>  	huge page size is architecture dependent.
>  hugetlb_free_vmemmap
> -	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
> -	unused vmemmap pages associated with each HugeTLB page.
> +	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.

Heh, it would be convenient to call this

CONFIG_HUGETLB_PAGE_VMEMMAP_OPTIMIZATION (HVO) then.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
  2022-06-13  8:04   ` Oscar Salvador
@ 2022-06-13 18:15   ` Mike kravetz
  2022-06-20 16:57   ` Catalin Marinas
  2 siblings, 0 replies; 31+ messages in thread
From: Mike kravetz @ 2022-06-13 18:15 UTC (permalink / raw)
  To: Muchun Song
  Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc,
	Catalin Marinas, Will Deacon, Anshuman Khandual

On Mon, Jun 13, 2022 at 02:35:07PM +0800”, Muchun Song wrote:
> The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
> two conditions (enabled and pages in use).  Instead of coming up to
> an appropriate name, we could just delete it.  There is already a
> discussion about deleting it in thread [1].
> 
> There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
> hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
> However, it does not need to call hugetlb_optimize_vmemmap_enabled()
> in flush_dcache_page() since HugeTLB pages are always fully mapped and
> only head page will be set PG_dcache_clean meaning only head page's flag
> may need to be cleared (see commit cf5a501d985b).  So it is easy to
> remove hugetlb_optimize_vmemmap_enabled().
> 
> Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/mm/flush.c      | 13 +++----------
>  include/linux/page-flags.h | 14 ++------------
>  2 files changed, 5 insertions(+), 22 deletions(-)

Thanks.  I think this makes the code less confusing.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
  2022-06-13  6:35 ` [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling Muchun Song
  2022-06-13  8:10   ` Oscar Salvador
@ 2022-06-13 18:28   ` Mike kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Mike kravetz @ 2022-06-13 18:28 UTC (permalink / raw)
  To: Muchun Song; +Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:08PM +0800”, Muchun Song wrote:
> We hold an another reference to hugetlb_optimize_vmemmap_key when
> making vmemmap_optimize_mode on, because we use static_key to tell
> memory_hotplug that memory_hotplug.memmap_on_memory should be
> overridden.  However, this rule has gone when we have introduced
> SECTION_CANNOT_OPTIMIZE_VMEMMAP.  Therefore, we could simplify
> vmemmap_optimize_mode handling by not holding an another reference
> to hugetlb_optimize_vmemmap_key.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/page-flags.h |  6 ++---
>  mm/hugetlb_vmemmap.c       | 65 +++++-----------------------------------------
>  2 files changed, 9 insertions(+), 62 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b8b992cb201c..da7ccc3b16ad 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -200,8 +200,7 @@ enum pageflags {
>  #ifndef __GENERATING_BOUNDS_H
>  
>  #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> -DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
> -			 hugetlb_optimize_vmemmap_key);
> +DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
>  
>  /*
>   * If the feature of optimizing vmemmap pages associated with each HugeTLB
> @@ -221,8 +220,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
>   */
>  static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
>  {
> -	if (!static_branch_maybe(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
> -				 &hugetlb_optimize_vmemmap_key))
> +	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
>  		return page;

This also means that we not incur the extra page_fixed_fake_head checks
if there are no vmemmap optinmized hugetlb pages.  Nice!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
  2022-06-13  8:13   ` Oscar Salvador
  2022-06-13 15:39   ` David Hildenbrand
@ 2022-06-13 21:19   ` Mike Kravetz
  2022-06-14  3:17     ` Muchun Song
  2022-06-15 14:51   ` Joao Martins
  3 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2022-06-13 21:19 UTC (permalink / raw)
  To: Muchun Song; +Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:09PM +0800”, Muchun Song wrote:
> It it inconvenient to mention the feature of optimizing vmemmap pages associated
> with HugeTLB pages when communicating with others since there is no specific or
> abbreviated name for it when it is first introduced.  Let us give it a name HVO
> (HugeTLB Vmemmap Optimization) from now.
> 
> This commit also updates the document about "hugetlb_free_vmemmap" by the way
> discussed in thread [1].
> 
> Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---

> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index a90330d0a837..64e0d5c512e7 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -164,8 +164,7 @@ default_hugepagesz
>  	will all result in 256 2M huge pages being allocated.  Valid default
>  	huge page size is architecture dependent.
>  hugetlb_free_vmemmap
> -	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
> -	unused vmemmap pages associated with each HugeTLB page.
> +	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.

I think we need to define HVO before using it here.  Readers may be able
to determine the meaning, but to be sure I would suggest:

When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set this enables
HugeTLB Vmemmap Optimization (HVO).

Everything else looks good to me.
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  2022-06-13  6:35 ` [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c Muchun Song
  2022-06-13  8:15   ` Oscar Salvador
@ 2022-06-13 21:34   ` Mike Kravetz
  1 sibling, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2022-06-13 21:34 UTC (permalink / raw)
  To: Muchun Song; +Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:10PM +0800”, Muchun Song wrote:
> When I first introduced vmemmap manipulation functions related to HugeTLB,
> I thought those functions may be reused by other modules (e.g. using
> similar approach to optimize vmemmap pages, unfortunately, the DAX used the
> same approach but does not use those functions).  After two years, we didn't
> see any other users.  So move those functions to hugetlb_vmemmap.c.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/mm.h   |   7 -
>  mm/hugetlb_vmemmap.c | 391 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/sparse-vmemmap.c  | 391 ---------------------------------------------------
>  3 files changed, 390 insertions(+), 399 deletions(-)

Code movement without any functional change.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param()
  2022-06-13  6:35 ` [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param() Muchun Song
@ 2022-06-13 21:43   ` Mike Kravetz
  0 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2022-06-13 21:43 UTC (permalink / raw)
  To: Muchun Song; +Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:35:11PM +0800”, Muchun Song wrote:
> After the following commit and previous commit applied.
> 
>   78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")
> 
> There is no order requirement between the parameter of "hugetlb_free_vmemmap"
> and "hugepages" since we have removed the check of whether HVO is enabled
> from hugetlb_vmemmap_init(). Therefore we can safely replace early_param()
> with core_param() to simplify the code.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb_vmemmap.c | 10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-13  9:01     ` Muchun Song
@ 2022-06-14  0:22       ` Mike Kravetz
  2022-06-14  4:17         ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2022-06-14  0:22 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 05:01:43PM +0800”, Muchun Song wrote:
> On Mon, Jun 13, 2022 at 10:33:48AM +0200, Oscar Salvador wrote:
> > On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
> > > -static __init int hugetlb_vmemmap_sysctls_init(void)
> > > +static int __init hugetlb_vmemmap_init(void)
> > >  {
> > > +	const struct hstate *h;
> > > +	bool optimizable = false;
> > > +
> > >  	/*
> > > -	 * If "struct page" crosses page boundaries, the vmemmap pages cannot
> > > -	 * be optimized.
> > > +	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> > > +	 * page structs that can be used when HVO is enabled.
> > >  	 */
> > > -	if (is_power_of_2(sizeof(struct page)))
> > > -		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
> > > +	BUILD_BUG_ON(__NR_USED_SUBPAGE >= RESERVE_VMEMMAP_SIZE / sizeof(struct page));
> > 
> > I need to take another look, but from the first glance there is something
> > here that caught my eye.
> >
> 
> Thanks for taking a look. This is introduced in commit f41f2ed43ca5.
>  
> > > +
> > > +	for_each_hstate(h) {
> > > +		char buf[16];
> > > +		unsigned int size = 0;
> > > +
> > > +		if (hugetlb_vmemmap_optimizable(h))
> > > +			size = hugetlb_vmemmap_size(h) - RESERVE_VMEMMAP_SIZE;
> > > +		optimizable = size ? true : optimizable;
> > 
> > This feels weird, just use false instead of optimizable.
> >
> 
> This is a loop, we shoud keep "optimizable" as "true" as long as there is one
> hstate is optimizable. How about:
> 
>   if (size)
> 	optimizable = true;
> 
> > > +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf,
> > > +				sizeof(buf));
> > > +		pr_info("%d KiB vmemmap can be optimized for a %s page\n",
> > > +			size / SZ_1K, buf);
> > 
> > I do not have a strong opinion but I wonder whether this brings a lot.
> >
> 
> I thought the users can know what size HugeTLB is optimizable via
> this log.  E.g. On aarch64, 64KB HugeTLB cannot be optimizable.
> I do not have a strong opinion as well, if anyone think it is
> unnecessary, I'll drop it in next version.

I do not have a strong opinion.  I think it adds a little information.  For me,
the new logging of number of pages vmemmap optimized at boot seems a bit
redundant.  Here is a BEFORE/AFTER comparison.

BEFORE
------
[    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
...
[    0.330930] HugeTLB: can optimize 4095 vmemmap pages for hugepages-1048576kB
[    0.350450] HugeTLB: can optimize 7 vmemmap pages for hugepages-2048kB
[    0.359282] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.359285] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages

AFTER
-----
[    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
...
[    0.409068] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.409071] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
[    1.246107] HugeTLB: 16380 KiB vmemmap can be optimized for a 1.00 GiB page
[    1.246110] HugeTLB: 28 KiB vmemmap can be optimized for a 2.00 MiB page
[    1.246123] HugeTLB: 512 huge pages whose vmemmap are optimized at boot

When I read those messages, I am not sure if 'optimized' is the best
word to use.  I know that using alloc/free throughout the code was
confusing.  But, wouldn't it perhaps be more clear to the end user if
the messages read?

HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page

Also, how about having report_hugepages() call a routine that prints the
vmemmmap savings.  Then output could then look something like:

HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
	 16380 KiB vmemmap can be freed for a 1.00 GiB page
HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
	 28 KiB vmemmap can be free for a 2.00 MiB page

Not insisting on these changes.  Just wanted to share the ideas.


Overall, the code improvements look good.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13 15:39   ` David Hildenbrand
@ 2022-06-14  3:15     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-14  3:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: mike.kravetz, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 05:39:59PM +0200, David Hildenbrand wrote:
> On 13.06.22 08:35, Muchun Song wrote:
> > It it inconvenient to mention the feature of optimizing vmemmap pages associated
> > with HugeTLB pages when communicating with others since there is no specific or
> > abbreviated name for it when it is first introduced.  Let us give it a name HVO
> > (HugeTLB Vmemmap Optimization) from now.
> > 
> > This commit also updates the document about "hugetlb_free_vmemmap" by the way
> > discussed in thread [1].
> > 
> > Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt |  7 ++++---
> >  Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +--
> >  Documentation/admin-guide/sysctl/vm.rst         |  3 +--
> >  fs/Kconfig                                      | 13 ++++++-------
> >  mm/hugetlb_vmemmap.c                            |  8 ++++----
> >  mm/hugetlb_vmemmap.h                            |  4 ++--
> >  6 files changed, 18 insertions(+), 20 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 391b43fee93e..7539553b3fb0 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1725,12 +1725,13 @@
> >  	hugetlb_free_vmemmap=
> >  			[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >  			enabled.
> > +			Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
> >  			Allows heavy hugetlb users to free up some more
> >  			memory (7 * PAGE_SIZE for each 2MB hugetlb page).
> > -			Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
> > +			Format: { on | off (default) }
> >  
> > -			[oO][Nn]/Y/y/1: enable the feature
> > -			[oO][Ff]/N/n/0: disable the feature
> > +			on: enable HVO
> > +			off: disable HVO
> >  
> >  			Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
> >  			the default is on.
> > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> > index a90330d0a837..64e0d5c512e7 100644
> > --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> > @@ -164,8 +164,7 @@ default_hugepagesz
> >  	will all result in 256 2M huge pages being allocated.  Valid default
> >  	huge page size is architecture dependent.
> >  hugetlb_free_vmemmap
> > -	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
> > -	unused vmemmap pages associated with each HugeTLB page.
> > +	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.
> 
> Heh, it would be convenient to call this
> 
> CONFIG_HUGETLB_PAGE_VMEMMAP_OPTIMIZATION (HVO) then.
>

Thanks for pointing it out. I would take Mike's suggestion. I would change to:

  When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set this enables
  HugeTLB Vmemmap Optimization (HVO).

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13 21:19   ` Mike Kravetz
@ 2022-06-14  3:17     ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-14  3:17 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 02:19:45PM -0700, Mike Kravetz wrote:
> On Mon, Jun 13, 2022 at 02:35:09PM +0800”, Muchun Song wrote:
> > It it inconvenient to mention the feature of optimizing vmemmap pages associated
> > with HugeTLB pages when communicating with others since there is no specific or
> > abbreviated name for it when it is first introduced.  Let us give it a name HVO
> > (HugeTLB Vmemmap Optimization) from now.
> > 
> > This commit also updates the document about "hugetlb_free_vmemmap" by the way
> > discussed in thread [1].
> > 
> > Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> 
> > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> > index a90330d0a837..64e0d5c512e7 100644
> > --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> > @@ -164,8 +164,7 @@ default_hugepagesz
> >  	will all result in 256 2M huge pages being allocated.  Valid default
> >  	huge page size is architecture dependent.
> >  hugetlb_free_vmemmap
> > -	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
> > -	unused vmemmap pages associated with each HugeTLB page.
> > +	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HVO.
> 
> I think we need to define HVO before using it here.  Readers may be able
> to determine the meaning, but to be sure I would suggest:
>

Agree.
 
> When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set this enables
> HugeTLB Vmemmap Optimization (HVO).
> 

I would use this. Thanks.

> Everything else looks good to me.
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks for your review.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-14  0:22       ` Mike Kravetz
@ 2022-06-14  4:17         ` Muchun Song
  2022-06-14 16:57           ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-14  4:17 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Mon, Jun 13, 2022 at 05:22:30PM -0700, Mike Kravetz wrote:
> On Mon, Jun 13, 2022 at 05:01:43PM +0800”, Muchun Song wrote:
> > On Mon, Jun 13, 2022 at 10:33:48AM +0200, Oscar Salvador wrote:
> > > On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
> > > > -static __init int hugetlb_vmemmap_sysctls_init(void)
> > > > +static int __init hugetlb_vmemmap_init(void)
> > > >  {
> > > > +	const struct hstate *h;
> > > > +	bool optimizable = false;
> > > > +
> > > >  	/*
> > > > -	 * If "struct page" crosses page boundaries, the vmemmap pages cannot
> > > > -	 * be optimized.
> > > > +	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> > > > +	 * page structs that can be used when HVO is enabled.
> > > >  	 */
> > > > -	if (is_power_of_2(sizeof(struct page)))
> > > > -		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
> > > > +	BUILD_BUG_ON(__NR_USED_SUBPAGE >= RESERVE_VMEMMAP_SIZE / sizeof(struct page));
> > > 
> > > I need to take another look, but from the first glance there is something
> > > here that caught my eye.
> > >
> > 
> > Thanks for taking a look. This is introduced in commit f41f2ed43ca5.
> >  
> > > > +
> > > > +	for_each_hstate(h) {
> > > > +		char buf[16];
> > > > +		unsigned int size = 0;
> > > > +
> > > > +		if (hugetlb_vmemmap_optimizable(h))
> > > > +			size = hugetlb_vmemmap_size(h) - RESERVE_VMEMMAP_SIZE;
> > > > +		optimizable = size ? true : optimizable;
> > > 
> > > This feels weird, just use false instead of optimizable.
> > >
> > 
> > This is a loop, we shoud keep "optimizable" as "true" as long as there is one
> > hstate is optimizable. How about:
> > 
> >   if (size)
> > 	optimizable = true;
> > 
> > > > +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf,
> > > > +				sizeof(buf));
> > > > +		pr_info("%d KiB vmemmap can be optimized for a %s page\n",
> > > > +			size / SZ_1K, buf);
> > > 
> > > I do not have a strong opinion but I wonder whether this brings a lot.
> > >
> > 
> > I thought the users can know what size HugeTLB is optimizable via
> > this log.  E.g. On aarch64, 64KB HugeTLB cannot be optimizable.
> > I do not have a strong opinion as well, if anyone think it is
> > unnecessary, I'll drop it in next version.
> 
> I do not have a strong opinion.  I think it adds a little information.  For me,
> the new logging of number of pages vmemmap optimized at boot seems a bit
> redundant.  Here is a BEFORE/AFTER comparison.
>

Well, I'll drop the "new logging".
 
> BEFORE
> ------
> [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> ...
> [    0.330930] HugeTLB: can optimize 4095 vmemmap pages for hugepages-1048576kB
> [    0.350450] HugeTLB: can optimize 7 vmemmap pages for hugepages-2048kB
> [    0.359282] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [    0.359285] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> 
> AFTER
> -----
> [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> ...
> [    0.409068] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [    0.409071] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> [    1.246107] HugeTLB: 16380 KiB vmemmap can be optimized for a 1.00 GiB page
> [    1.246110] HugeTLB: 28 KiB vmemmap can be optimized for a 2.00 MiB page
> [    1.246123] HugeTLB: 512 huge pages whose vmemmap are optimized at boot
> 
> When I read those messages, I am not sure if 'optimized' is the best
> word to use.  I know that using alloc/free throughout the code was
> confusing.  But, wouldn't it perhaps be more clear to the end user if
> the messages read?

Well, I agree with you at least. "free" may be more friendly to the end
users.  I'll change the word "optimized" to "freed".

> 
> HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
> 
> Also, how about having report_hugepages() call a routine that prints the
> vmemmmap savings.  Then output could then look something like:
> 
> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> 	 16380 KiB vmemmap can be freed for a 1.00 GiB page
> HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
> 	 28 KiB vmemmap can be free for a 2.00 MiB page
>

Well, we eliminate the prefix "HugeTLB:" for memory saving log.
Maybe it is not a good choice since it it not easy to grep the log
(e.g. dmesg | grep "HugeTLB" will not show saving log).  If
we combine both 2-line log into one line, the log becomes a bit long.
So I'd like to not eliminate the prefix but gather this 2-line log
into one place. I mean "HugeTLB: registered 1.00 GiB page size,
pre-allocated 0 pages" is just followed by "HugeTLB: 28 KiB vmemmap
can be freed for a 2.00 MiB page" without any log insertion in
between. But I have no strong opinion do this, I'd likt to listen
to your opinion before making decision to do those changes.

Thanks.

> Not insisting on these changes.  Just wanted to share the ideas.
> 
> 
> Overall, the code improvements look good.
> -- 
> Mike Kravetz
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-14  4:17         ` Muchun Song
@ 2022-06-14 16:57           ` Mike Kravetz
  2022-06-15 12:33             ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2022-06-14 16:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Tue, Jun 14, 2022 at 12:17:21PM +0800”, Muchun Song wrote:
> On Mon, Jun 13, 2022 at 05:22:30PM -0700, Mike Kravetz wrote:
> > On Mon, Jun 13, 2022 at 05:01:43PM +0800”, Muchun Song wrote:
> > > On Mon, Jun 13, 2022 at 10:33:48AM +0200, Oscar Salvador wrote:
> > > > On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
>  
> > BEFORE
> > ------
> > [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> > ...
> > [    0.330930] HugeTLB: can optimize 4095 vmemmap pages for hugepages-1048576kB
> > [    0.350450] HugeTLB: can optimize 7 vmemmap pages for hugepages-2048kB
> > [    0.359282] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > [    0.359285] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> > 
> > AFTER
> > -----
> > [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> > ...
> > [    0.409068] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > [    0.409071] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> > [    1.246107] HugeTLB: 16380 KiB vmemmap can be optimized for a 1.00 GiB page
> > [    1.246110] HugeTLB: 28 KiB vmemmap can be optimized for a 2.00 MiB page
> > [    1.246123] HugeTLB: 512 huge pages whose vmemmap are optimized at boot
> > 
> > When I read those messages, I am not sure if 'optimized' is the best
> > word to use.  I know that using alloc/free throughout the code was
> > confusing.  But, wouldn't it perhaps be more clear to the end user if
> > the messages read?
> 
> Well, I agree with you at least. "free" may be more friendly to the end
> users.  I'll change the word "optimized" to "freed".
> 
> > 
> > HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
> > 
> > Also, how about having report_hugepages() call a routine that prints the
> > vmemmmap savings.  Then output could then look something like:
> > 
> > HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> > 	 16380 KiB vmemmap can be freed for a 1.00 GiB page
> > HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
> > 	 28 KiB vmemmap can be free for a 2.00 MiB page
> >
> 
> Well, we eliminate the prefix "HugeTLB:" for memory saving log.
> Maybe it is not a good choice since it it not easy to grep the log
> (e.g. dmesg | grep "HugeTLB" will not show saving log).  If

Agree, we should not drop the HugeTLB prefix from the message.

> we combine both 2-line log into one line, the log becomes a bit long.
> So I'd like to not eliminate the prefix but gather this 2-line log
> into one place. I mean "HugeTLB: registered 1.00 GiB page size,
> pre-allocated 0 pages" is just followed by "HugeTLB: 28 KiB vmemmap
> can be freed for a 2.00 MiB page" without any log insertion in
> between. But I have no strong opinion do this, I'd likt to listen
> to your opinion before making decision to do those changes.

I think we are talking about the same thing.  Did you make a mistake when
saying amount of vmemmap freed for 2MB page followed 1GB hstate size?

My thought was that it would be good to log the vmemmap savings near the
message that notes the hstate/huge page size and any precallocation.
Just my opinion.  So messages would look like:

HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
HugeTLB: 28 KiB vmemmap can be free for a 2.00 MiB page

I really do not have strong opinions on this.  However, if we are going
to change the messages we should try to make them as useful and clear as
possible.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  2022-06-14 16:57           ` Mike Kravetz
@ 2022-06-15 12:33             ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-15 12:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, david, akpm, corbet, linux-mm, linux-kernel, linux-doc

On Tue, Jun 14, 2022 at 09:57:35AM -0700, Mike Kravetz wrote:
> On Tue, Jun 14, 2022 at 12:17:21PM +0800”, Muchun Song wrote:
> > On Mon, Jun 13, 2022 at 05:22:30PM -0700, Mike Kravetz wrote:
> > > On Mon, Jun 13, 2022 at 05:01:43PM +0800”, Muchun Song wrote:
> > > > On Mon, Jun 13, 2022 at 10:33:48AM +0200, Oscar Salvador wrote:
> > > > > On Mon, Jun 13, 2022 at 02:35:12PM +0800, Muchun Song wrote:
> >  
> > > BEFORE
> > > ------
> > > [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> > > ...
> > > [    0.330930] HugeTLB: can optimize 4095 vmemmap pages for hugepages-1048576kB
> > > [    0.350450] HugeTLB: can optimize 7 vmemmap pages for hugepages-2048kB
> > > [    0.359282] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > > [    0.359285] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> > > 
> > > AFTER
> > > -----
> > > [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.19.0-rc1-next-20220610+ root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepages=512
> > > ...
> > > [    0.409068] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > > [    0.409071] HugeTLB registered 2.00 MiB page size, pre-allocated 512 pages
> > > [    1.246107] HugeTLB: 16380 KiB vmemmap can be optimized for a 1.00 GiB page
> > > [    1.246110] HugeTLB: 28 KiB vmemmap can be optimized for a 2.00 MiB page
> > > [    1.246123] HugeTLB: 512 huge pages whose vmemmap are optimized at boot
> > > 
> > > When I read those messages, I am not sure if 'optimized' is the best
> > > word to use.  I know that using alloc/free throughout the code was
> > > confusing.  But, wouldn't it perhaps be more clear to the end user if
> > > the messages read?
> > 
> > Well, I agree with you at least. "free" may be more friendly to the end
> > users.  I'll change the word "optimized" to "freed".
> > 
> > > 
> > > HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
> > > 
> > > Also, how about having report_hugepages() call a routine that prints the
> > > vmemmmap savings.  Then output could then look something like:
> > > 
> > > HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> > > 	 16380 KiB vmemmap can be freed for a 1.00 GiB page
> > > HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
> > > 	 28 KiB vmemmap can be free for a 2.00 MiB page
> > >
> > 
> > Well, we eliminate the prefix "HugeTLB:" for memory saving log.
> > Maybe it is not a good choice since it it not easy to grep the log
> > (e.g. dmesg | grep "HugeTLB" will not show saving log).  If
> 
> Agree, we should not drop the HugeTLB prefix from the message.
> 
> > we combine both 2-line log into one line, the log becomes a bit long.
> > So I'd like to not eliminate the prefix but gather this 2-line log
> > into one place. I mean "HugeTLB: registered 1.00 GiB page size,
> > pre-allocated 0 pages" is just followed by "HugeTLB: 28 KiB vmemmap
> > can be freed for a 2.00 MiB page" without any log insertion in
> > between. But I have no strong opinion do this, I'd likt to listen
> > to your opinion before making decision to do those changes.
> 
> I think we are talking about the same thing.  Did you make a mistake when
> saying amount of vmemmap freed for 2MB page followed 1GB hstate size?
> 
> My thought was that it would be good to log the vmemmap savings near the
> message that notes the hstate/huge page size and any precallocation.
> Just my opinion.  So messages would look like:
> 
> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
> HugeTLB: registered 2.00 MiB page size, pre-allocated 512 pages
> HugeTLB: 28 KiB vmemmap can be free for a 2.00 MiB page
>

I am sure we are on the same page. This looks good to me.

> I really do not have strong opinions on this.  However, if we are going
> to change the messages we should try to make them as useful and clear as
> possible.
> 

Make sense. I'll do it in v2.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
                     ` (2 preceding siblings ...)
  2022-06-13 21:19   ` Mike Kravetz
@ 2022-06-15 14:51   ` Joao Martins
  2022-06-16  3:28     ` Muchun Song
  3 siblings, 1 reply; 31+ messages in thread
From: Joao Martins @ 2022-06-15 14:51 UTC (permalink / raw)
  To: Muchun Song, mike.kravetz
  Cc: linux-mm, linux-kernel, linux-doc, akpm, corbet, david

On 6/13/22 07:35, Muchun Song wrote:
> It it inconvenient to mention the feature of optimizing vmemmap pages associated
> with HugeTLB pages when communicating with others since there is no specific or
> abbreviated name for it when it is first introduced.  Let us give it a name HVO
> (HugeTLB Vmemmap Optimization) from now.
> 

Just thought I would throw this suggestion, even though I am probably too late.

I find the term "vmemmap deduplication" more self-explanatory (at least for me)
to refer to your technique ,and similarly s/optimize/dedup. Or vmemmap tail page
deduplication (too verbose maybe) because really that's what this optimization is all
about. OTOH it would slightly deviate from what maybe established now
in hugetlb code.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-15 14:51   ` Joao Martins
@ 2022-06-16  3:28     ` Muchun Song
  2022-06-16 22:27       ` Mike Kravetz
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-16  3:28 UTC (permalink / raw)
  To: Joao Martins, mike.kravetz, david
  Cc: linux-mm, linux-kernel, linux-doc, akpm, corbet

On Wed, Jun 15, 2022 at 03:51:51PM +0100, Joao Martins wrote:
> On 6/13/22 07:35, Muchun Song wrote:
> > It it inconvenient to mention the feature of optimizing vmemmap pages associated
> > with HugeTLB pages when communicating with others since there is no specific or
> > abbreviated name for it when it is first introduced.  Let us give it a name HVO
> > (HugeTLB Vmemmap Optimization) from now.
> > 
> 
> Just thought I would throw this suggestion, even though I am probably too late.
> 

Not too late, we are still discussing the name.

> I find the term "vmemmap deduplication" more self-explanatory (at least for me)
> to refer to your technique ,and similarly s/optimize/dedup. Or vmemmap tail page
> deduplication (too verbose maybe) because really that's what this optimization is all
> about. OTOH it would slightly deviate from what maybe established now
> in hugetlb code.
>

Well, I have looked up this word "deduplication" which refers to a method of
eliminating a dataset’s redundant data.  At least I agree with you "deduplication"
is more expressive for my technique.  So I am thinking of renaming "HVO" to "HVD (
HugeTLB Vmemmap Deduplication)".  In this series (patch 6), I have renamed
hugetlb_vmemmap_alloc/free to hugetlb_vmemmmap_optimize/restore.  I am also
thinking of replacing it to:

  hugetlb_vmemmmap_deduplicate vs hugetlb_vmemmmap_duplicate.

Many other places in hugetlb_vmemmap.c use "optimize" word, maybe most of them do
not need to be changed since "deduplication" is also a __optimization__ technique.

Hi Mike and David:

What your opinion on this? I want to hear some thoughts from you.

THanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-16  3:28     ` Muchun Song
@ 2022-06-16 22:27       ` Mike Kravetz
  2022-06-17  7:49         ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Kravetz @ 2022-06-16 22:27 UTC (permalink / raw)
  To: Muchun Song
  Cc: Joao Martins, david, linux-mm, linux-kernel, linux-doc, akpm, corbet

On 06/16/22 11:28, Muchun Song wrote:
> On Wed, Jun 15, 2022 at 03:51:51PM +0100, Joao Martins wrote:
> > On 6/13/22 07:35, Muchun Song wrote:
> > > It it inconvenient to mention the feature of optimizing vmemmap pages associated
> > > with HugeTLB pages when communicating with others since there is no specific or
> > > abbreviated name for it when it is first introduced.  Let us give it a name HVO
> > > (HugeTLB Vmemmap Optimization) from now.
> > > 
> > 
> > Just thought I would throw this suggestion, even though I am probably too late.
> > 
> 
> Not too late, we are still discussing the name.
> 
> > I find the term "vmemmap deduplication" more self-explanatory (at least for me)
> > to refer to your technique ,and similarly s/optimize/dedup. Or vmemmap tail page
> > deduplication (too verbose maybe) because really that's what this optimization is all
> > about. OTOH it would slightly deviate from what maybe established now
> > in hugetlb code.
> >
> 
> Well, I have looked up this word "deduplication" which refers to a method of
> eliminating a dataset’s redundant data.  At least I agree with you "deduplication"
> is more expressive for my technique.  So I am thinking of renaming "HVO" to "HVD (
> HugeTLB Vmemmap Deduplication)".  In this series (patch 6), I have renamed
> hugetlb_vmemmap_alloc/free to hugetlb_vmemmmap_optimize/restore.  I am also
> thinking of replacing it to:
> 
>   hugetlb_vmemmmap_deduplicate vs hugetlb_vmemmmap_duplicate.
> 
> Many other places in hugetlb_vmemmap.c use "optimize" word, maybe most of them do
> not need to be changed since "deduplication" is also a __optimization__ technique.
> 
> Hi Mike and David:
> 
> What your opinion on this? I want to hear some thoughts from you.

I can understand Joao's preference for deduplication.  However, I can
also understand just using the term optimization.  IMO, neither is far
superior to the other.  It is mostly a matter of personal preference.

My preference would be to leave it as named in this series unless
someone has a strong preference for changing.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO
  2022-06-16 22:27       ` Mike Kravetz
@ 2022-06-17  7:49         ` Muchun Song
  0 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-17  7:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Joao Martins, david, linux-mm, linux-kernel, linux-doc, akpm, corbet

On Thu, Jun 16, 2022 at 03:27:40PM -0700, Mike Kravetz wrote:
> On 06/16/22 11:28, Muchun Song wrote:
> > On Wed, Jun 15, 2022 at 03:51:51PM +0100, Joao Martins wrote:
> > > On 6/13/22 07:35, Muchun Song wrote:
> > > > It it inconvenient to mention the feature of optimizing vmemmap pages associated
> > > > with HugeTLB pages when communicating with others since there is no specific or
> > > > abbreviated name for it when it is first introduced.  Let us give it a name HVO
> > > > (HugeTLB Vmemmap Optimization) from now.
> > > > 
> > > 
> > > Just thought I would throw this suggestion, even though I am probably too late.
> > > 
> > 
> > Not too late, we are still discussing the name.
> > 
> > > I find the term "vmemmap deduplication" more self-explanatory (at least for me)
> > > to refer to your technique ,and similarly s/optimize/dedup. Or vmemmap tail page
> > > deduplication (too verbose maybe) because really that's what this optimization is all
> > > about. OTOH it would slightly deviate from what maybe established now
> > > in hugetlb code.
> > >
> > 
> > Well, I have looked up this word "deduplication" which refers to a method of
> > eliminating a dataset’s redundant data.  At least I agree with you "deduplication"
> > is more expressive for my technique.  So I am thinking of renaming "HVO" to "HVD (
> > HugeTLB Vmemmap Deduplication)".  In this series (patch 6), I have renamed
> > hugetlb_vmemmap_alloc/free to hugetlb_vmemmmap_optimize/restore.  I am also
> > thinking of replacing it to:
> > 
> >   hugetlb_vmemmmap_deduplicate vs hugetlb_vmemmmap_duplicate.
> > 
> > Many other places in hugetlb_vmemmap.c use "optimize" word, maybe most of them do
> > not need to be changed since "deduplication" is also a __optimization__ technique.
> > 
> > Hi Mike and David:
> > 
> > What your opinion on this? I want to hear some thoughts from you.
> 
> I can understand Joao's preference for deduplication.  However, I can
> also understand just using the term optimization.  IMO, neither is far
> superior to the other.  It is mostly a matter of personal preference.
> 
> My preference would be to leave it as named in this series unless
> someone has a strong preference for changing.
>

All right. I'll keep the HVO name.

Thanks. 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()
  2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
  2022-06-13  8:04   ` Oscar Salvador
  2022-06-13 18:15   ` Mike kravetz
@ 2022-06-20 16:57   ` Catalin Marinas
  2 siblings, 0 replies; 31+ messages in thread
From: Catalin Marinas @ 2022-06-20 16:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, david, akpm, corbet, linux-mm, linux-kernel,
	linux-doc, Will Deacon, Anshuman Khandual

On Mon, Jun 13, 2022 at 02:35:07PM +0800, Muchun Song wrote:
> The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
> two conditions (enabled and pages in use).  Instead of coming up to
> an appropriate name, we could just delete it.  There is already a
> discussion about deleting it in thread [1].
> 
> There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
> hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
> However, it does not need to call hugetlb_optimize_vmemmap_enabled()
> in flush_dcache_page() since HugeTLB pages are always fully mapped and
> only head page will be set PG_dcache_clean meaning only head page's flag
> may need to be cleared (see commit cf5a501d985b).  So it is easy to
> remove hugetlb_optimize_vmemmap_enabled().
> 
> Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/ [1]
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-06-20 16:57 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-13  6:35 [PATCH 0/6] Simplify hugetlb vmemmap and improve its readability Muchun Song
2022-06-13  6:35 ` [PATCH 1/6] mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled() Muchun Song
2022-06-13  8:04   ` Oscar Salvador
2022-06-13 18:15   ` Mike kravetz
2022-06-20 16:57   ` Catalin Marinas
2022-06-13  6:35 ` [PATCH 2/6] mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling Muchun Song
2022-06-13  8:10   ` Oscar Salvador
2022-06-13  8:24     ` Muchun Song
2022-06-13 18:28   ` Mike kravetz
2022-06-13  6:35 ` [PATCH 3/6] mm: hugetlb_vmemmap: introduce the name HVO Muchun Song
2022-06-13  8:13   ` Oscar Salvador
2022-06-13 15:39   ` David Hildenbrand
2022-06-14  3:15     ` Muchun Song
2022-06-13 21:19   ` Mike Kravetz
2022-06-14  3:17     ` Muchun Song
2022-06-15 14:51   ` Joao Martins
2022-06-16  3:28     ` Muchun Song
2022-06-16 22:27       ` Mike Kravetz
2022-06-17  7:49         ` Muchun Song
2022-06-13  6:35 ` [PATCH 4/6] mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c Muchun Song
2022-06-13  8:15   ` Oscar Salvador
2022-06-13 21:34   ` Mike Kravetz
2022-06-13  6:35 ` [PATCH 5/6] mm: hugetlb_vmemmap: replace early_param() with core_param() Muchun Song
2022-06-13 21:43   ` Mike Kravetz
2022-06-13  6:35 ` [PATCH 6/6] mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability Muchun Song
2022-06-13  8:33   ` Oscar Salvador
2022-06-13  9:01     ` Muchun Song
2022-06-14  0:22       ` Mike Kravetz
2022-06-14  4:17         ` Muchun Song
2022-06-14 16:57           ` Mike Kravetz
2022-06-15 12:33             ` Muchun Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.