linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/21] Free some vmemmap pages of hugetlb page
@ 2020-11-08 14:10 Muchun Song
  2020-11-08 14:10 ` [PATCH v3 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
                   ` (21 more replies)
  0 siblings, 22 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Hi all,

This patch series will free some vmemmap pages(struct page structures)
associated with each hugetlbpage when preallocated to save memory.

Nowadays we track the status of physical page frames using `struct page`
arranged in one or more arrays. And here exists one-to-one mapping between
the physical page frame and the corresponding `struct page`.

The hugetlbpage support is built on top of multiple page size support
that is provided by most modern architectures. For example, x86 CPUs
normally support 4K and 2M (1G if architecturally supported) page sizes.
Every hugetlbpage has more than one `struct page`. The 2M hugetlbpage
has 512 `struct page` and 1G hugetlbpage has 4096 `struct page`. But
in the core of hugetlbpage only uses the first 4 `struct page` to store
metadata associated with each hugetlbpage. The rest of the `struct page`
are usually read the compound_head field which are all the same value.
If we can free some struct page memory to buddy system so that we can
save a lot of memory.

When the system boot up, every 2M hugetlbpage has 512 `struct page` which
is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).

   hugetlbpage                  struct pages(8 pages)          page frame(8 pages)
  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
  |           |                     |     0     | -------------> |     0     |
  |           |                     |     1     | -------------> |     1     |
  |           |                     |     2     | -------------> |     2     |
  |           |                     |     3     | -------------> |     3     |
  |           |                     |     4     | -------------> |     4     |
  |     2M    |                     |     5     | -------------> |     5     |
  |           |                     |     6     | -------------> |     6     |
  |           |                     |     7     | -------------> |     7     |
  |           |                     +-----------+                +-----------+
  |           |
  |           |
  +-----------+


When a hugetlbpage is preallocated, we can change the mapping from above to
bellow.

   hugetlbpage                  struct pages(8 pages)          page frame(8 pages)
  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
  |           |                     |     0     | -------------> |     0     |
  |           |                     |     1     | -------------> |     1     |
  |           |                     |     2     | -------------> +-----------+
  |           |                     |     3     | -----------------^ ^ ^ ^ ^
  |           |                     |     4     | -------------------+ | | |
  |     2M    |                     |     5     | ---------------------+ | |
  |           |                     |     6     | -----------------------+ |
  |           |                     |     7     | -------------------------+
  |           |                     +-----------+
  |           |
  |           |
  +-----------+

For tail pages, the value of compound_head is the same. So we can reuse
first page of tail page structs. We map the virtual addresses of the
remaining 6 pages of tail page structs to the first tail page struct,
and then free these 6 pages. Therefore, we need to reserve at least 2
pages as vmemmap areas.

When a hugetlbpage is freed to the buddy system, we should allocate six
pages for vmemmap pages and restore the previous mapping relationship.

If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very
substantial gain. On our server, run some SPDK/QEMU applications which
will use 1000GB hugetlbpage. With this feature enabled, we can save
~16GB(1G hugepage)/~11GB(2MB hugepage) memory.

Because there are vmemmap page tables reconstruction on the freeing/allocating
path, it increases some overhead. Here are some overhead analysis.

1) Allocating 10240 2MB hugetlb pages.

   a) With this patch series applied:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.166s
   user     0m0.000s
   sys      0m0.166s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [8K, 16K)           8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [16K, 32K)          1868 |@@@@@@@@@@@                                         |
   [32K, 64K)            10 |                                                    |
   [64K, 128K)            2 |                                                    |

   b) Without this patch series:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.066s
   user     0m0.000s
   sys      0m0.066s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)           10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)             62 |                                                    |
   [16K, 32K)             2 |                                                    |

   Summarize: this feature is about ~2x slower than before.

2) Freeing 10240 @MB hugetlb pages.

   a) With this patch series applied:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.004s
   user     0m0.000s
   sys      0m0.002s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [16K, 32K)         10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

   b) Without this patch series:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.077s
   user     0m0.001s
   sys      0m0.075s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)            9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)            287 |@                                                   |
   [16K, 32K)             3 |                                                    |

   Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
              But according to the allocation test above, I think that here is
	      also ~2x slower than before.

              But why the 'real' time of patched is smaller than before? Because
	      In this patch series, the freeing hugetlb is asynchronous(through
	      kwoker).

Although the overhead has increased. But the overhead is not on the
allocating/freeing of each hugetlb page, it is only once when we reserve
some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation
is successful, the subsequent allocating, freeing and using are the same
as before (not patched). So I think that the overhead is acceptable.

  changelog in v3:
  1. Rename some helps function name. Thanks Mike.
  2. Rework some code. Thanks Mike and Oscar.
  3. Remap the tail vmemmap page with PAGE_KERNEL_RO instead of
     PAGE_KERNEL. Thanks Matthew.
  4. Add some overhead analysis in the cover letter.
  5. Use vmemap pmd table lock instead of a hugetlb specific global lock.

  changelog in v2:
  1. Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
  2. Fix some typo and code style problems.
  3. Remove unused handle_vmemmap_fault().
  4. Merge some commits to one commit suggested by Mike.

Muchun Song (21):
  mm/memory_hotplug: Move bootmem info registration API to
    bootmem_info.c
  mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c
  mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  mm/hugetlb: Introduce pgtable allocation/freeing helpers
  mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()
  mm/bootmem_info: Combine bootmem info and type into page->freelist
  mm/vmemmap: Initialize page table lock for vmemmap
  mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  mm/hugetlb: Defer freeing of hugetlb pages
  mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb
    page
  mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper
  mm/hugetlb: Use PG_slab to indicate split pmd
  mm/hugetlb: Support freeing vmemmap pages of gigantic page
  mm/hugetlb: Add a BUILD_BUG_ON to check if struct page size is a power
    of two
  mm/hugetlb: Set the PageHWPoison to the raw error page
  mm/hugetlb: Flush work when dissolving hugetlb page
  mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  mm/hugetlb: Merge pte to huge pmd only for gigantic page
  mm/hugetlb: Gather discrete indexes of tail page
  mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct
    page

 Documentation/admin-guide/kernel-parameters.txt |   9 +
 Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
 arch/x86/include/asm/hugetlb.h                  |  17 +
 arch/x86/include/asm/pgtable_64_types.h         |   8 +
 arch/x86/mm/init_64.c                           |   7 +-
 fs/Kconfig                                      |  16 +
 include/linux/bootmem_info.h                    |  79 +++
 include/linux/hugetlb.h                         |  45 ++
 include/linux/hugetlb_cgroup.h                  |  15 +-
 include/linux/memory_hotplug.h                  |  27 -
 include/linux/mm.h                              |  49 ++
 mm/Makefile                                     |   1 +
 mm/bootmem_info.c                               | 124 ++++
 mm/hugetlb.c                                    | 806 +++++++++++++++++++++++-
 mm/memory_hotplug.c                             | 116 ----
 mm/sparse-vmemmap.c                             |  31 +
 mm/sparse.c                                     |   5 +-
 17 files changed, 1181 insertions(+), 177 deletions(-)
 create mode 100644 include/linux/bootmem_info.h
 create mode 100644 mm/bootmem_info.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-08 14:10 ` [PATCH v3 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Move bootmem info registration common API to individual bootmem_info.c
for later patch use. This is just code movement without any functional
change.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/x86/mm/init_64.c          |  1 +
 include/linux/bootmem_info.h   | 27 ++++++++++++
 include/linux/memory_hotplug.h | 23 ----------
 mm/Makefile                    |  1 +
 mm/bootmem_info.c              | 99 ++++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            | 91 +-------------------------------------
 6 files changed, 129 insertions(+), 113 deletions(-)
 create mode 100644 include/linux/bootmem_info.h
 create mode 100644 mm/bootmem_info.c

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..c7f7ad55b625 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/bootmem_info.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
new file mode 100644
index 000000000000..65bb9b23140f
--- /dev/null
+++ b/include/linux/bootmem_info.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_BOOTMEM_INFO_H
+#define __LINUX_BOOTMEM_INFO_H
+
+#include <linux/mmzone.h>
+
+/*
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
+ */
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
+
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
+#else
+static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+}
+#endif
+
+#endif /* __LINUX_BOOTMEM_INFO_H */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 51a877fec8da..19e5d067294c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,18 +33,6 @@ struct vmem_altmap;
 	___page;						   \
 })
 
-/*
- * Types for free bootmem stored in page->lru.next. These have to be in
- * some random range in unsigned long space for debugging purposes.
- */
-enum {
-	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
-	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
-	MIX_SECTION_INFO,
-	NODE_INFO,
-	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
-};
-
 /* Types for control the zone type of onlined and offlined memory */
 enum {
 	/* Offline the memory. */
@@ -209,13 +197,6 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
-#else
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-#endif
 extern void put_page_bootmem(struct page *page);
 extern void get_page_bootmem(unsigned long ingo, struct page *page,
 			     unsigned long type);
@@ -254,10 +235,6 @@ static inline int mhp_notimplemented(const char *func)
 	return -ENOSYS;
 }
 
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-
 static inline int try_online_node(int nid)
 {
 	return 0;
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..752111587c99 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)	+= kasan/
 obj-$(CONFIG_FAILSLAB) += failslab.o
+obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
new file mode 100644
index 000000000000..39fa8fc120bc
--- /dev/null
+++ b/mm/bootmem_info.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  linux/mm/bootmem_info.c
+ *
+ *  Copyright (C)
+ */
+#include <linux/mm.h>
+#include <linux/compiler.h>
+#include <linux/memblock.h>
+#include <linux/bootmem_info.h>
+#include <linux/memory_hotplug.h>
+
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	/* Get section's memmap address */
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	/*
+	 * Get page for the memmap's phys address
+	 * XXX: need more consideration for sparse_vmemmap...
+	 */
+	page = virt_to_page(memmap);
+	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
+
+	/* remember memmap's page */
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, SECTION_INFO);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+
+}
+#else /* CONFIG_SPARSEMEM_VMEMMAP */
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
+
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+	unsigned long i, pfn, end_pfn, nr_pages;
+	int node = pgdat->node_id;
+	struct page *page;
+
+	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	page = virt_to_page(pgdat);
+
+	for (i = 0; i < nr_pages; i++, page++)
+		get_page_bootmem(node, page, NODE_INFO);
+
+	pfn = pgdat->node_start_pfn;
+	end_pfn = pgdat_end_pfn(pgdat);
+
+	/* register section info */
+	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		/*
+		 * Some platforms can assign the same pfn to multiple nodes - on
+		 * node0 as well as nodeN.  To avoid registering a pfn against
+		 * multiple nodes we check that this pfn does not already
+		 * reside in some other nodes.
+		 */
+		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
+			register_page_bootmem_info_section(pfn);
+	}
+}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index baded53b9ff9..2da4ad071456 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -21,6 +21,7 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
+#include <linux/bootmem_info.h>
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
@@ -167,96 +168,6 @@ void put_page_bootmem(struct page *page)
 	}
 }
 
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	/* Get section's memmap address */
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	/*
-	 * Get page for the memmap's phys address
-	 * XXX: need more consideration for sparse_vmemmap...
-	 */
-	page = virt_to_page(memmap);
-	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
-	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
-
-	/* remember memmap's page */
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, SECTION_INFO);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-
-}
-#else /* CONFIG_SPARSEMEM_VMEMMAP */
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-}
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
-
-void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-	unsigned long i, pfn, end_pfn, nr_pages;
-	int node = pgdat->node_id;
-	struct page *page;
-
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
-	page = virt_to_page(pgdat);
-
-	for (i = 0; i < nr_pages; i++, page++)
-		get_page_bootmem(node, page, NODE_INFO);
-
-	pfn = pgdat->node_start_pfn;
-	end_pfn = pgdat_end_pfn(pgdat);
-
-	/* register section info */
-	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		/*
-		 * Some platforms can assign the same pfn to multiple nodes - on
-		 * node0 as well as nodeN.  To avoid registering a pfn against
-		 * multiple nodes we check that this pfn does not already
-		 * reside in some other nodes.
-		 */
-		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
-			register_page_bootmem_info_section(pfn);
-	}
-}
-#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
  2020-11-08 14:10 ` [PATCH v3 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-08 14:10 ` [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the later patch, we will use {get,put}_page_bootmem() to initialize
the page for vmemmap or free vmemmap page to buddy. So move them out of
CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any
functional change.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/x86/mm/init_64.c          |  2 +-
 include/linux/bootmem_info.h   | 13 +++++++++++++
 include/linux/memory_hotplug.h |  4 ----
 mm/bootmem_info.c              | 26 ++++++++++++++++++++++++++
 mm/memory_hotplug.c            | 27 ---------------------------
 mm/sparse.c                    |  1 +
 6 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index c7f7ad55b625..0a45f062826e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1572,7 +1572,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return err;
 }
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE)
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long nr_pages)
 {
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 65bb9b23140f..4ed6dee1adc9 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -18,10 +18,23 @@ enum {
 
 #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
+
+void get_page_bootmem(unsigned long info, struct page *page,
+		      unsigned long type);
+void put_page_bootmem(struct page *page);
 #else
 static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
 }
+
+static inline void put_page_bootmem(struct page *page)
+{
+}
+
+static inline void get_page_bootmem(unsigned long info, struct page *page,
+				    unsigned long type)
+{
+}
 #endif
 
 #endif /* __LINUX_BOOTMEM_INFO_H */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 19e5d067294c..c9f3361fe84b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -197,10 +197,6 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-extern void put_page_bootmem(struct page *page);
-extern void get_page_bootmem(unsigned long ingo, struct page *page,
-			     unsigned long type);
-
 void get_online_mems(void);
 void put_online_mems(void);
 
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 39fa8fc120bc..d276e96e487f 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -10,6 +10,32 @@
 #include <linux/bootmem_info.h>
 #include <linux/memory_hotplug.h>
 
+void get_page_bootmem(unsigned long info,  struct page *page,
+		      unsigned long type)
+{
+	page->freelist = (void *)type;
+	SetPagePrivate(page);
+	set_page_private(page, info);
+	page_ref_inc(page);
+}
+
+void put_page_bootmem(struct page *page)
+{
+	unsigned long type;
+
+	type = (unsigned long) page->freelist;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
+
+	if (page_ref_dec_return(page) == 1) {
+		page->freelist = NULL;
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+		INIT_LIST_HEAD(&page->lru);
+		free_reserved_page(page);
+	}
+}
+
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2da4ad071456..ae57eedc341f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -21,7 +21,6 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
-#include <linux/bootmem_info.h>
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
@@ -142,32 +141,6 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-void get_page_bootmem(unsigned long info,  struct page *page,
-		      unsigned long type)
-{
-	page->freelist = (void *)type;
-	SetPagePrivate(page);
-	set_page_private(page, info);
-	page_ref_inc(page);
-}
-
-void put_page_bootmem(struct page *page)
-{
-	unsigned long type;
-
-	type = (unsigned long) page->freelist;
-	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
-	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
-
-	if (page_ref_dec_return(page) == 1) {
-		page->freelist = NULL;
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		INIT_LIST_HEAD(&page->lru);
-		free_reserved_page(page);
-	}
-}
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
diff --git a/mm/sparse.c b/mm/sparse.c
index b25ad8e64839..a4138410d890 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/bootmem_info.h>
 
 #include "internal.h"
 #include <asm/dma.h>
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
  2020-11-08 14:10 ` [PATCH v3 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
  2020-11-08 14:10 ` [PATCH v3 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-09 13:52   ` Oscar Salvador
  2020-11-08 14:10 ` [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
whether to enable the feature of freeing unused vmemmap associated
with HugeTLB pages. Now only support x86.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/mm/init_64.c |  2 +-
 fs/Kconfig            | 16 ++++++++++++++++
 mm/bootmem_info.c     |  3 +--
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0a45f062826e..0435bee2e172 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
 
 static void __init register_page_bootmem_info(void)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
 	int i;
 
 	for_each_online_node(i)
diff --git a/fs/Kconfig b/fs/Kconfig
index 976e8b9033c4..21b8d39a9715 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -245,6 +245,22 @@ config HUGETLBFS
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
 
+config HUGETLB_PAGE_FREE_VMEMMAP
+	bool "Free unused vmemmap associated with HugeTLB pages"
+	default y
+	depends on X86
+	depends on HUGETLB_PAGE
+	depends on SPARSEMEM_VMEMMAP
+	depends on HAVE_BOOTMEM_INFO_NODE
+	help
+	  There are many struct page structures associated with each HugeTLB
+	  page. But we only use a few struct page structures. In this case,
+	  it wastes some memory. It is better to free the unused struct page
+	  structures to buddy system which can save some memory. For
+	  architectures that support it, say Y here.
+
+	  If unsure, say N.
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index d276e96e487f..fcab5a3f8cc0 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -10,8 +10,7 @@
 #include <linux/bootmem_info.h>
 #include <linux/memory_hotplug.h>
 
-void get_page_bootmem(unsigned long info,  struct page *page,
-		      unsigned long type)
+void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
 {
 	page->freelist = (void *)type;
 	SetPagePrivate(page);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (2 preceding siblings ...)
  2020-11-08 14:10 ` [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-09 16:48   ` Oscar Salvador
  2020-11-08 14:10 ` [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

If the size of hugetlb page is 2MB, we need 512 struct page structures
(8 pages) to be associated with it. As far as I know, we only use the
first 4 struct page structures. Use of first 4 struct page structures
comes from HUGETLB_CGROUP_MIN_ORDER.

For tail pages, the value of compound_head is the same. So we can reuse
first page of tail page structs. We map the virtual addresses of the
remaining 6 pages of tail page structs to the first tail page struct,
and then free these 6 pages. Therefore, we need to reserve at least 2
pages as vmemmap areas.

So we introduce a new nr_free_vmemmap_pages field in the hstate to
indicate how many vmemmap pages associated with a hugetlb page that we
can free to buddy system.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5cc5f802dd4..eed3dd3bd626 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -492,6 +492,9 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	unsigned int nr_free_vmemmap_pages;
+#endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
 	struct cftype cgroup_files_dfl[7];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 81a41aa080a5..a0007902fafb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1292,6 +1292,42 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+/*
+ * There are 512 struct page structs(8 pages) associated with each 2MB
+ * hugetlb page. For tail pages, the value of compound_dtor is the same.
+ * So we can reuse first page of tail page structs. We map the virtual
+ * addresses of the remaining 6 pages of tail page structs to the first
+ * tail page struct, and then free these 6 pages. Therefore, we need to
+ * reserve at least 2 pages as vmemmap areas.
+ */
+#define RESERVE_VMEMMAP_NR	2U
+
+static void __init hugetlb_vmemmap_init(struct hstate *h)
+{
+	unsigned int order = huge_page_order(h);
+	unsigned int vmemmap_pages;
+
+	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
+	/*
+	 * The head page and the first tail page not free to buddy system,
+	 * the others page will map to the first tail page. So there are
+	 * (@vmemmap_pages - RESERVE_VMEMMAP_NR) pages can be freed.
+	 */
+	if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR))
+		h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR;
+	else
+		h->nr_free_vmemmap_pages = 0;
+
+	pr_debug("HugeTLB: can free %d vmemmap pages for %s\n",
+		 h->nr_free_vmemmap_pages, h->name);
+}
+#else
+static inline void hugetlb_vmemmap_init(struct hstate *h)
+{
+}
+#endif
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -3285,6 +3321,8 @@ void __init hugetlb_add_hstate(unsigned int order)
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
+	hugetlb_vmemmap_init(h);
+
 	parsed_hstate = h;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (3 preceding siblings ...)
  2020-11-08 14:10 ` [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-09 17:21   ` Oscar Salvador
  2020-11-11  0:47   ` Mike Kravetz
  2020-11-08 14:10 ` [PATCH v3 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

On x86_64, vmemmap is always PMD mapped if the machine has hugepages
support and if we have 2MB contiguos pages and PMD aligned. If we want
to free the unused vmemmap pages, we have to split the huge pmd firstly.
So we should pre-allocate pgtable to split PMD to PTE.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  10 +++++
 mm/hugetlb.c            | 111 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index eed3dd3bd626..d81c262418db 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -593,6 +593,16 @@ static inline unsigned int blocks_per_huge_page(struct hstate *h)
 
 #include <asm/hugetlb.h>
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+#ifndef VMEMMAP_HPAGE_SHIFT
+#define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
+#endif
+#define VMEMMAP_HPAGE_ORDER		(VMEMMAP_HPAGE_SHIFT - PAGE_SHIFT)
+#define VMEMMAP_HPAGE_NR		(1 << VMEMMAP_HPAGE_ORDER)
+#define VMEMMAP_HPAGE_SIZE		((1UL) << VMEMMAP_HPAGE_SHIFT)
+#define VMEMMAP_HPAGE_MASK		(~(VMEMMAP_HPAGE_SIZE - 1))
+#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
+
 #ifndef is_hugepage_only_range
 static inline int is_hugepage_only_range(struct mm_struct *mm,
 					unsigned long addr, unsigned long len)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a0007902fafb..5c7be2ee7e15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1303,6 +1303,108 @@ static inline void destroy_compound_gigantic_page(struct page *page,
  */
 #define RESERVE_VMEMMAP_NR	2U
 
+#define page_huge_pte(page)	((page)->pmd_huge_pte)
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return h->nr_free_vmemmap_pages;
+}
+
+static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
+}
+
+static inline unsigned long vmemmap_pages_size_per_hpage(struct hstate *h)
+{
+	return (unsigned long)vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
+}
+
+static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
+{
+	unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
+
+	/*
+	 * No need pre-allocate page tabels when there is no vmemmap pages
+	 * to free.
+	 */
+	if (!free_vmemmap_pages_per_hpage(h))
+		return 0;
+
+	return ALIGN(vmemmap_size, VMEMMAP_HPAGE_SIZE) >> VMEMMAP_HPAGE_SHIFT;
+}
+
+static inline void vmemmap_pgtable_init(struct page *page)
+{
+	page_huge_pte(page) = NULL;
+}
+
+static void vmemmap_pgtable_deposit(struct page *page, pgtable_t pgtable)
+{
+	/* FIFO */
+	if (!page_huge_pte(page))
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &page_huge_pte(page)->lru);
+	page_huge_pte(page) = pgtable;
+}
+
+static pgtable_t vmemmap_pgtable_withdraw(struct page *page)
+{
+	pgtable_t pgtable;
+
+	/* FIFO */
+	pgtable = page_huge_pte(page);
+	page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
+						       struct page, lru);
+	if (page_huge_pte(page))
+		list_del(&pgtable->lru);
+	return pgtable;
+}
+
+static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	int i;
+	pgtable_t pgtable;
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	if (!nr)
+		return 0;
+
+	vmemmap_pgtable_init(page);
+
+	for (i = 0; i < nr; i++) {
+		pte_t *pte_p;
+
+		pte_p = pte_alloc_one_kernel(&init_mm);
+		if (!pte_p)
+			goto out;
+		vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
+	}
+
+	return 0;
+out:
+	while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
+		pte_free_kernel(&init_mm, page_to_virt(pgtable));
+	return -ENOMEM;
+}
+
+static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
+{
+	pgtable_t pgtable;
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	if (!nr)
+		return;
+
+	pgtable = page_huge_pte(page);
+	if (!pgtable)
+		return;
+
+	while (nr-- && (pgtable = vmemmap_pgtable_withdraw(page)))
+		pte_free_kernel(&init_mm, page_to_virt(pgtable));
+}
+
 static void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int order = huge_page_order(h);
@@ -1326,6 +1428,15 @@ static void __init hugetlb_vmemmap_init(struct hstate *h)
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
 }
+
+static inline int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	return 0;
+}
+
+static inline void vmemmap_pgtable_free(struct hstate *h, struct page *page)
+{
+}
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (4 preceding siblings ...)
  2020-11-08 14:10 ` [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-08 14:10 ` [PATCH v3 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the later patch, we can use the free_vmemmap_page() to free the
unused vmemmap pages and initialize a page for vmemmap page using
via prepare_vmemmap_page().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/bootmem_info.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 4ed6dee1adc9..ce9d8c97369d 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -3,6 +3,7 @@
 #define __LINUX_BOOTMEM_INFO_H
 
 #include <linux/mmzone.h>
+#include <linux/mm.h>
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
@@ -22,6 +23,30 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
 void get_page_bootmem(unsigned long info, struct page *page,
 		      unsigned long type);
 void put_page_bootmem(struct page *page);
+
+static inline void free_vmemmap_page(struct page *page)
+{
+	VM_WARN_ON(!PageReserved(page) || page_ref_count(page) != 2);
+
+	/* bootmem page has reserved flag in the reserve_bootmem_region */
+	if (PageReserved(page)) {
+		unsigned long magic = (unsigned long)page->freelist;
+
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
+			put_page_bootmem(page);
+		else
+			WARN_ON(1);
+	}
+}
+
+static inline void prepare_vmemmap_page(struct page *page)
+{
+	unsigned long section_nr = pfn_to_section_nr(page_to_pfn(page));
+
+	get_page_bootmem(section_nr, page, SECTION_INFO);
+	__SetPageReserved(page);
+	adjust_managed_page_count(page, -1);
+}
 #else
 static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (5 preceding siblings ...)
  2020-11-08 14:10 ` [PATCH v3 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
@ 2020-11-08 14:10 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap Muchun Song
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:10 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The page->private shares storage with page->ptl. In the later patch,
we will use the page->ptl. So here we combine bootmem info and type
into page->freelist so that we can do not use page->private.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/bootmem_info.h | 18 ++++++++++++++++--
 mm/bootmem_info.c            | 12 ++++++------
 mm/sparse.c                  |  4 ++--
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index ce9d8c97369d..b5786a8b412e 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -6,7 +6,7 @@
 #include <linux/mm.h>
 
 /*
- * Types for free bootmem stored in page->lru.next. These have to be in
+ * Types for free bootmem stored in page->freelist. These have to be in
  * some random range in unsigned long space for debugging purposes.
  */
 enum {
@@ -17,6 +17,20 @@ enum {
 	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
 };
 
+#define BOOTMEM_TYPE_BITS	4
+#define BOOTMEM_TYPE_MAX	((1UL << BOOTMEM_TYPE_BITS) - 1)
+#define BOOTMEM_INFO_MAX	(ULONG_MAX >> BOOTMEM_TYPE_BITS)
+
+static inline unsigned long page_bootmem_type(struct page *page)
+{
+	return (unsigned long)page->freelist & BOOTMEM_TYPE_MAX;
+}
+
+static inline unsigned long page_bootmem_info(struct page *page)
+{
+	return (unsigned long)page->freelist >> BOOTMEM_TYPE_BITS;
+}
+
 #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
 
@@ -30,7 +44,7 @@ static inline void free_vmemmap_page(struct page *page)
 
 	/* bootmem page has reserved flag in the reserve_bootmem_region */
 	if (PageReserved(page)) {
-		unsigned long magic = (unsigned long)page->freelist;
+		unsigned long magic = page_bootmem_type(page);
 
 		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
 			put_page_bootmem(page);
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index fcab5a3f8cc0..9baf163965fd 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -12,9 +12,9 @@
 
 void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
 {
-	page->freelist = (void *)type;
-	SetPagePrivate(page);
-	set_page_private(page, info);
+	BUG_ON(info > BOOTMEM_INFO_MAX);
+	BUG_ON(type > BOOTMEM_TYPE_MAX);
+	page->freelist = (void *)((info << BOOTMEM_TYPE_BITS) | type);
 	page_ref_inc(page);
 }
 
@@ -22,14 +22,12 @@ void put_page_bootmem(struct page *page)
 {
 	unsigned long type;
 
-	type = (unsigned long) page->freelist;
+	type = page_bootmem_type(page);
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
 	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (page_ref_dec_return(page) == 1) {
 		page->freelist = NULL;
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
 		free_reserved_page(page);
 	}
@@ -101,6 +99,8 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 	int node = pgdat->node_id;
 	struct page *page;
 
+	BUILD_BUG_ON(MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE > BOOTMEM_TYPE_MAX);
+
 	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index a4138410d890..fca5fa38c2bc 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -740,12 +740,12 @@ static void free_map_bootmem(struct page *memmap)
 		>> PAGE_SHIFT;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = (unsigned long) page->freelist;
+		magic = page_bootmem_type(page);
 
 		BUG_ON(magic == NODE_INFO);
 
 		maps_section_nr = pfn_to_section_nr(page_to_pfn(page));
-		removing_section_nr = page_private(page);
+		removing_section_nr = page_bootmem_info(page);
 
 		/*
 		 * When this function is called, the removing section is
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (6 preceding siblings ...)
  2020-11-08 14:10 ` [PATCH v3 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-09 18:11   ` Oscar Salvador
  2020-11-08 14:11 ` [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the register_page_bootmem_memmap, the slab allocator is not ready
yet. So when ALLOC_SPLIT_PTLOCKS, we use init_mm.page_table_lock.
otherwise we use per page table lock(page->ptl). In the later patch,
we will use the vmemmap page table lock to guard the splitting of
the vmemmap huge PMD.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/mm/init_64.c |  2 ++
 include/linux/mm.h    | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0435bee2e172..a010101bbe24 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1610,6 +1610,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 		}
 		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
 
+		vmemmap_ptlock_init(pud_page(*pud));
+
 		if (!boot_cpu_has(X86_FEATURE_PSE)) {
 			next = (addr + PAGE_SIZE) & PAGE_MASK;
 			pmd = pmd_offset(pud, addr);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a12354e63e49..ce429614d1ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2169,6 +2169,26 @@ static inline bool ptlock_init(struct page *page)
 	return true;
 }
 
+#if ALLOC_SPLIT_PTLOCKS
+static inline void vmemmap_ptlock_init(struct page *page)
+{
+}
+#else
+static inline void vmemmap_ptlock_init(struct page *page)
+{
+	/*
+	 * prep_new_page() initialize page->private (and therefore page->ptl)
+	 * with 0. Make sure nobody took it in use in between.
+	 *
+	 * It can happen if arch try to use slab for page table allocation:
+	 * slab code uses page->{s_mem, counters}, which share storage with
+	 * page->ptl.
+	 */
+	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+	spin_lock_init(ptlock_ptr(page));
+}
+#endif
+
 #else	/* !USE_SPLIT_PTE_PTLOCKS */
 /*
  * We use mm->page_table_lock to guard all pagetable pages of the mm.
@@ -2180,6 +2200,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 static inline void ptlock_cache_init(void) {}
 static inline bool ptlock_init(struct page *page) { return true; }
 static inline void ptlock_free(struct page *page) {}
+static inline void vmemmap_ptlock_init(struct page *page) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline void pgtable_init(void)
@@ -2244,6 +2265,18 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 	return ptlock_ptr(pmd_to_page(pmd));
 }
 
+#if ALLOC_SPLIT_PTLOCKS
+static inline spinlock_t *vmemmap_pmd_lockptr(pmd_t *pmd)
+{
+	return &init_mm.page_table_lock;
+}
+#else
+static inline spinlock_t *vmemmap_pmd_lockptr(pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_to_page(pmd));
+}
+#endif
+
 static inline bool pmd_ptlock_init(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -2269,6 +2302,11 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 	return &mm->page_table_lock;
 }
 
+static inline spinlock_t *vmemmap_pmd_lockptr(pmd_t *pmd)
+{
+	return &init_mm.page_table_lock;
+}
+
 static inline bool pmd_ptlock_init(struct page *page) { return true; }
 static inline void pmd_ptlock_free(struct page *page) {}
 
@@ -2283,6 +2321,13 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 	return ptl;
 }
 
+static inline spinlock_t *vmemmap_pmd_lock(pmd_t *pmd)
+{
+	spinlock_t *ptl = vmemmap_pmd_lockptr(pmd);
+	spin_lock(ptl);
+	return ptl;
+}
+
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
 	if (!pmd_ptlock_init(page))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (7 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-09 18:51   ` Oscar Salvador
  2020-11-08 14:11 ` [PATCH v3 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we allocate a hugetlb page from the buddy, we should free the
unused vmemmap pages associated with it. We can do that in the
prep_new_huge_page().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/include/asm/hugetlb.h          |   9 ++
 arch/x86/include/asm/pgtable_64_types.h |   8 ++
 include/linux/hugetlb.h                 |   8 ++
 include/linux/mm.h                      |   4 +
 mm/hugetlb.c                            | 166 ++++++++++++++++++++++++++++++++
 mm/sparse-vmemmap.c                     |  31 ++++++
 6 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 1721b1aadeb1..c601fe042832 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -4,6 +4,15 @@
 
 #include <asm/page.h>
 #include <asm-generic/hugetlb.h>
+#include <asm/pgtable.h>
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+#define vmemmap_pmd_huge vmemmap_pmd_huge
+static inline bool vmemmap_pmd_huge(pmd_t *pmd)
+{
+	return pmd_large(*pmd);
+}
+#endif
 
 #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
 
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..bedbd2e7d06c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -139,6 +139,14 @@ extern unsigned int ptrs_per_p4d;
 # define VMEMMAP_START		__VMEMMAP_BASE_L4
 #endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
 
+/*
+ * VMEMMAP_SIZE - allows the whole linear region to be covered by
+ *                a struct page array.
+ */
+#define VMEMMAP_SIZE		(1UL << (__VIRTUAL_MASK_SHIFT - PAGE_SHIFT - \
+					 1 + ilog2(sizeof(struct page))))
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
+
 #define VMALLOC_END		(VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
 
 #define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d81c262418db..afb9b18771c4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -594,6 +594,14 @@ static inline unsigned int blocks_per_huge_page(struct hstate *h)
 #include <asm/hugetlb.h>
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+#ifndef vmemmap_pmd_huge
+#define vmemmap_pmd_huge vmemmap_pmd_huge
+static inline bool vmemmap_pmd_huge(pmd_t *pmd)
+{
+	return pmd_huge(*pmd);
+}
+#endif
+
 #ifndef VMEMMAP_HPAGE_SHIFT
 #define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce429614d1ab..480faca94c23 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3025,6 +3025,10 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+pmd_t *vmemmap_to_pmd(const void *page);
+#endif
+
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5c7be2ee7e15..27f0269aab70 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1293,6 +1293,8 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+#include <linux/bootmem_info.h>
+
 /*
  * There are 512 struct page structs(8 pages) associated with each 2MB
  * hugetlb page. For tail pages, the value of compound_dtor is the same.
@@ -1305,6 +1307,13 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 
 #define page_huge_pte(page)	((page)->pmd_huge_pte)
 
+#define vmemmap_hpage_addr_end(addr, end)				\
+({									\
+	unsigned long __boundary;					\
+	__boundary = ((addr) + VMEMMAP_HPAGE_SIZE) & VMEMMAP_HPAGE_MASK;\
+	(__boundary - 1 < (end) - 1) ? __boundary : (end);		\
+})
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return h->nr_free_vmemmap_pages;
@@ -1424,6 +1433,147 @@ static void __init hugetlb_vmemmap_init(struct hstate *h)
 	pr_debug("HugeTLB: can free %d vmemmap pages for %s\n",
 		 h->nr_free_vmemmap_pages, h->name);
 }
+
+static inline int freed_vmemmap_hpage(struct page *page)
+{
+	return atomic_read(&page->_mapcount) + 1;
+}
+
+static inline int freed_vmemmap_hpage_inc(struct page *page)
+{
+	return atomic_inc_return_relaxed(&page->_mapcount) + 1;
+}
+
+static inline int freed_vmemmap_hpage_dec(struct page *page)
+{
+	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
+}
+
+static inline void free_vmemmap_page_list(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		list_del(&page->lru);
+		free_vmemmap_page(page);
+	}
+}
+
+static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
+					 unsigned long start,
+					 unsigned int nr_free,
+					 struct list_head *free_pages)
+{
+	/* Make the tail pages are mapped read-only. */
+	pgprot_t pgprot = PAGE_KERNEL_RO;
+	pte_t entry = mk_pte(reuse, pgprot);
+	unsigned long addr;
+	unsigned long end = start + (nr_free << PAGE_SHIFT);
+
+	for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
+		struct page *page;
+		pte_t old = *ptep;
+
+		VM_WARN_ON(!pte_present(old));
+		page = pte_page(old);
+		list_add(&page->lru, free_pages);
+
+		set_pte_at(&init_mm, addr, ptep, entry);
+	}
+}
+
+static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					 unsigned long addr,
+					 struct list_head *free_pages)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	addr = start;
+	do {
+		unsigned int nr_pages;
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse)
+			reuse = pte_page(ptep[-1]);
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		nr_pages = (next - addr) >> PAGE_SHIFT;
+		__free_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
+					     free_pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+static void split_vmemmap_pmd(pmd_t *pmd, pte_t *pte_p, unsigned long addr)
+{
+	int i;
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct mm_struct *mm = &init_mm;
+	struct page *page;
+	pmd_t old_pmd, _pmd;
+
+	old_pmd = READ_ONCE(*pmd);
+	page = pmd_page(old_pmd);
+	pmd_populate_kernel(mm, &_pmd, pte_p);
+
+	for (i = 0; i < VMEMMAP_HPAGE_NR; i++, addr += PAGE_SIZE) {
+		pte_t entry, *pte;
+
+		entry = mk_pte(page + i, pgprot);
+		pte = pte_offset_kernel(&_pmd, addr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, addr, pte, entry);
+	}
+
+	/* make pte visible before pmd */
+	smp_wmb();
+	pmd_populate_kernel(mm, pmd, pte_p);
+}
+
+static void split_vmemmap_huge_page(struct hstate *h, struct page *head,
+				    pmd_t *pmd)
+{
+	pgtable_t pgtable;
+	unsigned long start = (unsigned long)head & VMEMMAP_HPAGE_MASK;
+	unsigned long addr = start;
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	while (nr-- && (pgtable = vmemmap_pgtable_withdraw(head))) {
+		VM_BUG_ON(freed_vmemmap_hpage(pgtable));
+		split_vmemmap_pmd(pmd++, page_to_virt(pgtable), addr);
+		addr += VMEMMAP_HPAGE_SIZE;
+	}
+
+	flush_tlb_kernel_range(start, addr);
+}
+
+static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	LIST_HEAD(free_pages);
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	pmd = vmemmap_to_pmd(head);
+	ptl = vmemmap_pmd_lock(pmd);
+	if (vmemmap_pmd_huge(pmd)) {
+		VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
+		split_vmemmap_huge_page(h, head, pmd);
+	}
+
+	__free_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages);
+	freed_vmemmap_hpage_inc(pmd_page(*pmd));
+	spin_unlock(ptl);
+
+	free_vmemmap_page_list(&free_pages);
+}
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -1437,6 +1587,10 @@ static inline int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
 static inline void vmemmap_pgtable_free(struct hstate *h, struct page *page)
 {
 }
+
+static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
@@ -1645,6 +1799,10 @@ void free_huge_page(struct page *page)
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
+	free_huge_page_vmemmap(h, page);
+	/* Must be called before the initialization of @page->lru */
+	vmemmap_pgtable_free(h, page);
+
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	set_hugetlb_cgroup(page, NULL);
@@ -1897,6 +2055,14 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 	if (!page)
 		return NULL;
 
+	if (vmemmap_pgtable_prealloc(h, page)) {
+		if (hstate_is_gigantic(h))
+			free_gigantic_page(page, huge_page_order(h));
+		else
+			put_page(page);
+		return NULL;
+	}
+
 	if (hstate_is_gigantic(h))
 		prep_compound_gigantic_page(page, huge_page_order(h));
 	prep_new_huge_page(h, page, page_to_nid(page));
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..4b35d1655a2f 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -263,3 +263,34 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 
 	return pfn_to_page(pfn);
 }
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+/*
+ * Walk a vmemmap address to the pmd it maps.
+ */
+pmd_t *vmemmap_to_pmd(const void *page)
+{
+	unsigned long addr = (unsigned long)page;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (addr < VMEMMAP_START || addr >= VMEMMAP_END)
+		return NULL;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+	pud = pud_offset(p4d, addr);
+
+	if (pud_none(*pud) || pud_bad(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, addr);
+
+	return pmd;
+}
+#endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 10/21] mm/hugetlb: Defer freeing of hugetlb pages
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (8 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the subsequent patch, we will allocate the vmemmap pages when free
huge pages. But update_and_free_page() is be called from a non-task
context(and hold hugetlb_lock), we can defer the actual freeing in
a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 89 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 27f0269aab70..ded7f0fbde35 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1220,7 +1220,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned int order)
+static void __free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
 	 * If the page isn't allocated using the cma allocator,
@@ -1287,11 +1287,14 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 {
 	return NULL;
 }
-static inline void free_gigantic_page(struct page *page, unsigned int order) { }
+static inline void __free_gigantic_page(struct page *page,
+					unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+static void __free_hugepage(struct hstate *h, struct page *page);
+
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 #include <linux/bootmem_info.h>
 
@@ -1574,6 +1577,64 @@ static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 
 	free_vmemmap_page_list(&free_pages);
 }
+
+/*
+ * As update_and_free_page() is be called from a non-task context(and hold
+ * hugetlb_lock), we can defer the actual freeing in a workqueue to prevent
+ * use GFP_ATOMIC to allocate a lot of vmemmap pages.
+ *
+ * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of
+ * pages to be freed and frees them one-by-one. As the page->mapping pointer
+ * is going to be cleared in update_hpage_vmemmap_workfn() anyway, it is
+ * reused as the llist_node structure of a lockless linked list of huge
+ * pages to be freed.
+ */
+static LLIST_HEAD(hpage_update_freelist);
+
+static void update_hpage_vmemmap_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_update_freelist);
+
+	while (node) {
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		page->mapping = NULL;
+		__free_hugepage(page_hstate(page), page);
+
+		cond_resched();
+	}
+}
+static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+	/* No need to allocate vmemmap pages */
+	if (!free_vmemmap_pages_per_hpage(h)) {
+		__free_hugepage(h, page);
+		return;
+	}
+
+	/*
+	 * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
+	 * pages.
+	 *
+	 * Only call schedule_work() if hpage_update_freelist is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * hasn't retrieved the list yet.
+	 */
+	if (llist_add((struct llist_node *)&page->mapping,
+		      &hpage_update_freelist))
+		schedule_work(&hpage_update_work);
+}
+
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	__free_gigantic_page(page, huge_page_order(h));
+}
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -1591,17 +1652,39 @@ static inline void vmemmap_pgtable_free(struct hstate *h, struct page *page)
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+	__free_hugepage(h, page);
+}
+
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	/*
+	 * Temporarily drop the hugetlb_lock, because
+	 * we might block in __free_gigantic_page().
+	 */
+	spin_unlock(&hugetlb_lock);
+	__free_gigantic_page(page, huge_page_order(h));
+	spin_lock(&hugetlb_lock);
+}
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
-	int i;
-
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
+
+	__update_and_free_page(h, page);
+}
+
+static void __free_hugepage(struct hstate *h, struct page *page)
+{
+	int i;
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
@@ -1613,14 +1696,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
+		free_gigantic_page(h, page);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
@@ -2057,7 +2134,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 
 	if (vmemmap_pgtable_prealloc(h, page)) {
 		if (hstate_is_gigantic(h))
-			free_gigantic_page(page, huge_page_order(h));
+			free_gigantic_page(h, page);
 		else
 			put_page(page);
 		return NULL;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (9 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we free a hugetlb page to the buddy, we should allocate the vmemmap
pages associated with it. We can do that in the __free_hugepage().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ded7f0fbde35..8295911fe76e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1307,6 +1307,8 @@ static void __free_hugepage(struct hstate *h, struct page *page);
  * reserve at least 2 pages as vmemmap areas.
  */
 #define RESERVE_VMEMMAP_NR	2U
+#define RESERVE_VMEMMAP_SIZE	(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+#define GFP_VMEMMAP_PAGE	(GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
 
 #define page_huge_pte(page)	((page)->pmd_huge_pte)
 
@@ -1490,7 +1492,7 @@ static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
 					 struct list_head *free_pages)
 {
 	unsigned long next;
-	unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
 	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
 	struct page *reuse = NULL;
 
@@ -1578,6 +1580,106 @@ static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 	free_vmemmap_page_list(&free_pages);
 }
 
+static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
+					  unsigned long start,
+					  unsigned int nr_remap,
+					  struct list_head *remap_pages)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	void *from = (void *)page_private(reuse);
+	unsigned long addr, end = start + (nr_remap << PAGE_SHIFT);
+
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		void *to;
+		struct page *page;
+		pte_t entry, old = *ptep;
+
+		page = list_first_entry_or_null(remap_pages, struct page, lru);
+		list_del(&page->lru);
+		to = page_to_virt(page);
+		copy_page(to, from);
+
+		/*
+		 * Make sure that any data that writes to the @to is made
+		 * visible to the physical page.
+		 */
+		flush_kernel_vmap_range(to, PAGE_SIZE);
+
+		prepare_vmemmap_page(page);
+
+		entry = mk_pte(page, pgprot);
+		set_pte_at(&init_mm, addr, ptep++, entry);
+
+		VM_BUG_ON(!pte_present(old) || pte_page(old) != reuse);
+	}
+}
+
+static void __remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					  unsigned long addr,
+					  struct list_head *remap_pages)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	addr = start;
+	do {
+		unsigned int nr_pages;
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse) {
+			reuse = pte_page(ptep[-1]);
+			set_page_private(reuse, addr - PAGE_SIZE);
+		}
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		nr_pages = (next - addr) >> PAGE_SHIFT;
+		__remap_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
+					      remap_pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < free_vmemmap_pages_per_hpage(h); i++) {
+		struct page *page;
+
+		/* This should not fail */
+		page = alloc_page(GFP_VMEMMAP_PAGE);
+		list_add_tail(&page->lru, list);
+	}
+}
+
+static void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	LIST_HEAD(remap_pages);
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	alloc_vmemmap_pages(h, &remap_pages);
+
+	pmd = vmemmap_to_pmd(head);
+	ptl = vmemmap_pmd_lock(pmd);
+	__remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head,
+				      &remap_pages);
+	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
+		/*
+		 * Todo:
+		 * Merge pte to huge pmd if it has ever been split.
+		 */
+	}
+	spin_unlock(ptl);
+}
+
 /*
  * As update_and_free_page() is be called from a non-task context(and hold
  * hugetlb_lock), we can defer the actual freeing in a workqueue to prevent
@@ -1653,6 +1755,10 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
 
+static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+
 static inline void __update_and_free_page(struct hstate *h, struct page *page)
 {
 	__free_hugepage(h, page);
@@ -1685,6 +1791,8 @@ static void __free_hugepage(struct hstate *h, struct page *page)
 {
 	int i;
 
+	alloc_huge_page_vmemmap(h, page);
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (10 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The __free_huge_page_pmd_vmemmap and __remap_huge_page_pmd_vmemmap are
almost the same code. So introduce remap_free_huge_page_pmd_vmemmap
helper to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 98 ++++++++++++++++++++++++------------------------------------
 1 file changed, 39 insertions(+), 59 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8295911fe76e..5d3806476212 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1454,6 +1454,41 @@ static inline int freed_vmemmap_hpage_dec(struct page *page)
 	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
 }
 
+typedef void (*remap_pte_fn)(struct page *reuse, pte_t *ptep,
+			     unsigned long start, unsigned int nr_pages,
+			     struct list_head *pages);
+
+static void remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					unsigned long addr,
+					struct list_head *pages,
+					remap_pte_fn remap_fn)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	flush_cache_vunmap(start, end);
+
+	addr = start;
+	do {
+		unsigned int nr_pages;
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse) {
+			reuse = pte_page(ptep[-1]);
+			set_page_private(reuse, addr - PAGE_SIZE);
+		}
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		nr_pages = (next - addr) >> PAGE_SHIFT;
+		remap_fn(reuse, ptep, addr, nr_pages, pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
 static inline void free_vmemmap_page_list(struct list_head *list)
 {
 	struct page *page, *next;
@@ -1487,33 +1522,6 @@ static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
-static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
-					 unsigned long addr,
-					 struct list_head *free_pages)
-{
-	unsigned long next;
-	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
-	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
-	struct page *reuse = NULL;
-
-	addr = start;
-	do {
-		unsigned int nr_pages;
-		pte_t *ptep;
-
-		ptep = pte_offset_kernel(pmd, addr);
-		if (!reuse)
-			reuse = pte_page(ptep[-1]);
-
-		next = vmemmap_hpage_addr_end(addr, end);
-		nr_pages = (next - addr) >> PAGE_SHIFT;
-		__free_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
-					     free_pages);
-	} while (pmd++, addr = next, addr != end);
-
-	flush_tlb_kernel_range(start, end);
-}
-
 static void split_vmemmap_pmd(pmd_t *pmd, pte_t *pte_p, unsigned long addr)
 {
 	int i;
@@ -1573,7 +1581,8 @@ static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 		split_vmemmap_huge_page(h, head, pmd);
 	}
 
-	__free_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages);
+	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
+				    __free_huge_page_pte_vmemmap);
 	freed_vmemmap_hpage_inc(pmd_page(*pmd));
 	spin_unlock(ptl);
 
@@ -1614,35 +1623,6 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
-static void __remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
-					  unsigned long addr,
-					  struct list_head *remap_pages)
-{
-	unsigned long next;
-	unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
-	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
-	struct page *reuse = NULL;
-
-	addr = start;
-	do {
-		unsigned int nr_pages;
-		pte_t *ptep;
-
-		ptep = pte_offset_kernel(pmd, addr);
-		if (!reuse) {
-			reuse = pte_page(ptep[-1]);
-			set_page_private(reuse, addr - PAGE_SIZE);
-		}
-
-		next = vmemmap_hpage_addr_end(addr, end);
-		nr_pages = (next - addr) >> PAGE_SHIFT;
-		__remap_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
-					      remap_pages);
-	} while (pmd++, addr = next, addr != end);
-
-	flush_tlb_kernel_range(start, end);
-}
-
 static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
 {
 	int i;
@@ -1669,8 +1649,8 @@ static void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 
 	pmd = vmemmap_to_pmd(head);
 	ptl = vmemmap_pmd_lock(pmd);
-	__remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head,
-				      &remap_pages);
+	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
+				    __remap_huge_page_pte_vmemmap);
 	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
 		/*
 		 * Todo:
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (11 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we allocate hugetlb page from buddy, we may need split huge pmd
to pte. When we free the hugetlb page, we can merge pte to pmd. So
we need to distinguish whether the previous pmd has been split. The
page table is not allocated from slab. So we can reuse the PG_slab
to indicate that the pmd has been split.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5d3806476212..9b1ac52d9fdd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1565,6 +1565,25 @@ static void split_vmemmap_huge_page(struct hstate *h, struct page *head,
 	flush_tlb_kernel_range(start, addr);
 }
 
+static inline bool pmd_split(pmd_t *pmd)
+{
+	return PageSlab(pmd_page(*pmd));
+}
+
+static inline void set_pmd_split(pmd_t *pmd)
+{
+	/*
+	 * We should not use slab for page table allocation. So we can set
+	 * PG_slab to indicate that the pmd has been split.
+	 */
+	__SetPageSlab(pmd_page(*pmd));
+}
+
+static inline void clear_pmd_split(pmd_t *pmd)
+{
+	__ClearPageSlab(pmd_page(*pmd));
+}
+
 static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 	pmd_t *pmd;
@@ -1579,6 +1598,7 @@ static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 	if (vmemmap_pmd_huge(pmd)) {
 		VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
 		split_vmemmap_huge_page(h, head, pmd);
+		set_pmd_split(pmd);
 	}
 
 	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
@@ -1651,11 +1671,12 @@ static void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 	ptl = vmemmap_pmd_lock(pmd);
 	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
 				    __remap_huge_page_pte_vmemmap);
-	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
+	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
 		/*
 		 * Todo:
 		 * Merge pte to huge pmd if it has ever been split.
 		 */
+		clear_pmd_split(pmd);
 	}
 	spin_unlock(ptl);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (12 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 15/21] mm/hugetlb: Add a BUILD_BUG_ON to check if struct page size is a power of two Muchun Song
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The gigantic page is allocated by bootmem, if we want to free the
unused vmemmap pages. We also should allocate the page table. So
we also allocate page tables from bootmem.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index afb9b18771c4..f8ca4d251aa8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -506,6 +506,9 @@ struct hstate {
 struct huge_bootmem_page {
 	struct list_head list;
 	struct hstate *hstate;
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	pte_t *vmemmap_pte;
+#endif
 };
 
 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b1ac52d9fdd..ec0d33d2c426 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1419,6 +1419,62 @@ static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
 		pte_free_kernel(&init_mm, page_to_virt(pgtable));
 }
 
+static unsigned long __init gather_vmemmap_pgtable_prealloc(void)
+{
+	struct huge_bootmem_page *m, *tmp;
+	unsigned long nr_free = 0;
+
+	list_for_each_entry_safe(m, tmp, &huge_boot_pages, list) {
+		struct hstate *h = m->hstate;
+		unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+		unsigned int pgtable_size;
+
+		if (!nr)
+			continue;
+
+		pgtable_size = nr << PAGE_SHIFT;
+		m->vmemmap_pte = memblock_alloc_try_nid(pgtable_size,
+				PAGE_SIZE, 0, MEMBLOCK_ALLOC_ACCESSIBLE,
+				NUMA_NO_NODE);
+		if (!m->vmemmap_pte) {
+			nr_free++;
+			list_del(&m->list);
+			memblock_free_early(__pa(m), huge_page_size(h));
+		}
+	}
+
+	return nr_free;
+}
+
+static void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+					       struct page *page)
+{
+	int i;
+	struct hstate *h = m->hstate;
+	unsigned long pte = (unsigned long)m->vmemmap_pte;
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	if (!nr)
+		return;
+
+	vmemmap_pgtable_init(page);
+
+	for (i = 0; i < nr; i++, pte += PAGE_SIZE) {
+		pgtable_t pgtable = virt_to_page(pte);
+
+		__ClearPageReserved(pgtable);
+		vmemmap_pgtable_deposit(page, pgtable);
+	}
+
+	/*
+	 * If we had gigantic hugepages allocated at boot time, we need
+	 * to restore the 'stolen' pages to totalram_pages in order to
+	 * fix confusing memory reports from free(1) and another
+	 * side-effects, like CommitLimit going negative.
+	 */
+	adjust_managed_page_count(page, nr);
+}
+
 static void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int order = huge_page_order(h);
@@ -1752,6 +1808,16 @@ static inline void vmemmap_pgtable_free(struct hstate *h, struct page *page)
 {
 }
 
+static inline unsigned long gather_vmemmap_pgtable_prealloc(void)
+{
+	return 0;
+}
+
+static inline void gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+					       struct page *page)
+{
+}
+
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
@@ -3013,6 +3079,7 @@ static void __init gather_bootmem_prealloc(void)
 		WARN_ON(page_count(page) != 1);
 		prep_compound_huge_page(page, h->order);
 		WARN_ON(PageReserved(page));
+		gather_vmemmap_pgtable_init(m, page);
 		prep_new_huge_page(h, page, page_to_nid(page));
 		put_page(page); /* free it into the hugepage allocator */
 
@@ -3065,6 +3132,10 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			break;
 		cond_resched();
 	}
+
+	if (hstate_is_gigantic(h))
+		i -= gather_vmemmap_pgtable_prealloc();
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 15/21] mm/hugetlb: Add a BUILD_BUG_ON to check if struct page size is a power of two
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (13 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 16/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

We only can free the unused vmemmap to the buddy system when the
size of struct page is a power of two. So add a BUILD_BUG_ON to
check the illegal case.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec0d33d2c426..5aaa274b0684 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3764,6 +3764,10 @@ static int __init hugetlb_init(void)
 {
 	int i;
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	BUILD_BUG_ON_NOT_POWER_OF_2(sizeof(struct page));
+#endif
+
 	if (!hugepages_supported()) {
 		if (hugetlb_max_hstate || default_hstate_max_huge_pages)
 			pr_warn("HugeTLB: huge pages not supported, ignoring associated command-line parameters\n");
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 16/21] mm/hugetlb: Set the PageHWPoison to the raw error page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (14 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 15/21] mm/hugetlb: Add a BUILD_BUG_ON to check if struct page size is a power of two Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 17/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Because we reuse the first tail page, if we set PageHWPosion on a
tail page. It indicates that we may set PageHWPoison on a series
of pages. So we can use the head[4].mapping to record the real
error page index and set the raw error page PageHWPoison later.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5aaa274b0684..00a6e97629aa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1794,6 +1794,29 @@ static inline void free_gigantic_page(struct hstate *h, struct page *page)
 {
 	__free_gigantic_page(page, huge_page_order(h));
 }
+
+static inline void subpage_hwpoison_deliver(struct page *head)
+{
+	struct page *page = head;
+
+	if (PageHWPoison(head))
+		page = head + page_private(head + 4);
+
+	/*
+	 * Move PageHWPoison flag from head page to the raw error page,
+	 * which makes any subpages rather than the error page reusable.
+	 */
+	if (page != head) {
+		SetPageHWPoison(page);
+		ClearPageHWPoison(head);
+	}
+}
+
+static inline void set_subpage_hwpoison(struct page *head, struct page *page)
+{
+	if (PageHWPoison(head))
+		set_page_private(head + 4, page - head);
+}
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -1841,6 +1864,22 @@ static inline void free_gigantic_page(struct hstate *h, struct page *page)
 	__free_gigantic_page(page, huge_page_order(h));
 	spin_lock(&hugetlb_lock);
 }
+
+static inline void subpage_hwpoison_deliver(struct page *head)
+{
+}
+
+static inline void set_subpage_hwpoison(struct page *head, struct page *page)
+{
+	/*
+	 * Move PageHWPoison flag from head page to the raw error page,
+	 * which makes any subpages rather than the error page reusable.
+	 */
+	if (PageHWPoison(head) && page != head) {
+		SetPageHWPoison(page);
+		ClearPageHWPoison(head);
+	}
+}
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
@@ -1859,6 +1898,7 @@ static void __free_hugepage(struct hstate *h, struct page *page)
 	int i;
 
 	alloc_huge_page_vmemmap(h, page);
+	subpage_hwpoison_deliver(page);
 
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -2416,14 +2456,8 @@ int dissolve_free_huge_page(struct page *page)
 		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
-		/*
-		 * Move PageHWPoison flag from head page to the raw error page,
-		 * which makes any subpages rather than the error page reusable.
-		 */
-		if (PageHWPoison(head) && page != head) {
-			SetPageHWPoison(page);
-			ClearPageHWPoison(head);
-		}
+
+		set_subpage_hwpoison(head, page);
 		list_del(&head->lru);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 17/21] mm/hugetlb: Flush work when dissolving hugetlb page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (15 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 16/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 18/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

We should flush work when dissolving a hugetlb page to make sure that
the hugetlb page is freed to the buddy.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 00a6e97629aa..4cd2f4a6366a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1795,6 +1795,11 @@ static inline void free_gigantic_page(struct hstate *h, struct page *page)
 	__free_gigantic_page(page, huge_page_order(h));
 }
 
+static inline void flush_free_huge_page_work(void)
+{
+	flush_work(&hpage_update_work);
+}
+
 static inline void subpage_hwpoison_deliver(struct page *head)
 {
 	struct page *page = head;
@@ -1865,6 +1870,10 @@ static inline void free_gigantic_page(struct hstate *h, struct page *page)
 	spin_lock(&hugetlb_lock);
 }
 
+static inline void flush_free_huge_page_work(void)
+{
+}
+
 static inline void subpage_hwpoison_deliver(struct page *head)
 {
 }
@@ -2439,6 +2448,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 int dissolve_free_huge_page(struct page *page)
 {
 	int rc = -EBUSY;
+	bool need_flush = false;
 
 	/* Not to disrupt normal path by vainly holding hugetlb_lock */
 	if (!PageHuge(page))
@@ -2463,10 +2473,19 @@ int dissolve_free_huge_page(struct page *page)
 		h->free_huge_pages_node[nid]--;
 		h->max_huge_pages--;
 		update_and_free_page(h, head);
+		need_flush = true;
 		rc = 0;
 	}
 out:
 	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * We should flush work before return to make sure that
+	 * the hugetlb page is freed to the buddy.
+	 */
+	if (need_flush)
+		flush_free_huge_page_work();
+
 	return rc;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 18/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (16 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 17/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Add a kernel parameter hugetlb_free_vmemmap to disable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +++++++++
 Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +++
 mm/hugetlb.c                                    | 23 +++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5debfe238027..ccf07293cb63 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1551,6 +1551,15 @@
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugetlb_free_vmemmap=
+			[KNL] When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set,
+			this controls freeing unused vmemmap pages associated
+			with each HugeTLB page.
+			Format: { on (default) | off }
+
+			on:  enable the feature
+			off: disable the feature
+
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
 			Format: 0 | 1
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f7b1c7462991..7d6129ee97dd 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,6 +145,9 @@ default_hugepagesz
 
 	will all result in 256 2M huge pages being allocated.  Valid default
 	huge page size is architecture dependent.
+hugetlb_free_vmemmap
+	When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this disables freeing
+	unused vmemmap pages associated each HugeTLB page.
 
 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
 indicates the current number of pre-allocated huge pages of the default size.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4cd2f4a6366a..7c97a1d30fd9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1319,6 +1319,8 @@ static void __free_hugepage(struct hstate *h, struct page *page);
 	(__boundary - 1 < (end) - 1) ? __boundary : (end);		\
 })
 
+static bool hugetlb_free_vmemmap_disabled __initdata;
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return h->nr_free_vmemmap_pages;
@@ -1480,6 +1482,13 @@ static void __init hugetlb_vmemmap_init(struct hstate *h)
 	unsigned int order = huge_page_order(h);
 	unsigned int vmemmap_pages;
 
+	if (hugetlb_free_vmemmap_disabled) {
+		h->nr_free_vmemmap_pages = 0;
+		pr_info("HugeTLB: disable free vmemmap pages for %s\n",
+			h->name);
+		return;
+	}
+
 	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
 	/*
 	 * The head page and the first tail page not free to buddy system,
@@ -1822,6 +1831,20 @@ static inline void set_subpage_hwpoison(struct page *head, struct page *page)
 	if (PageHWPoison(head))
 		set_page_private(head + 4, page - head);
 }
+
+static int __init early_hugetlb_free_vmemmap_param(char *buf)
+{
+	if (!buf)
+		return -EINVAL;
+
+	if (!strcmp(buf, "off"))
+		hugetlb_free_vmemmap_disabled = true;
+	else if (strcmp(buf, "on"))
+		return -EINVAL;
+
+	return 0;
+}
+early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (17 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 18/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 20/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Merge pte to huge pmd if it has ever been split. Now only support
gigantic page which's vmemmap pages size is an integer multiple of
PMD_SIZE. This is the simplest case to handle.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/include/asm/hugetlb.h |   8 +++
 include/linux/hugetlb.h        |   8 +++
 mm/hugetlb.c                   | 108 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index c601fe042832..1de1c519a84a 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -12,6 +12,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 {
 	return pmd_large(*pmd);
 }
+
+#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
+static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
+{
+	pte_t entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_LARGE);
+
+	return __pmd(pte_val(entry));
+}
 #endif
 
 #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f8ca4d251aa8..32abfb420731 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -605,6 +605,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 }
 #endif
 
+#ifndef vmemmap_pmd_mkhuge
+#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
+static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
+{
+	return pmd_mkhuge(mk_pmd(page, PAGE_KERNEL));
+}
+#endif
+
 #ifndef VMEMMAP_HPAGE_SHIFT
 #define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c97a1d30fd9..52e56c3a9b72 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1708,6 +1708,63 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
+static void __replace_huge_page_pte_vmemmap(pte_t *ptep, unsigned long start,
+					    unsigned int nr, struct page *huge,
+					    struct list_head *free_pages)
+{
+	unsigned long addr;
+	unsigned long end = start + (nr << PAGE_SHIFT);
+	pgprot_t pgprot = PAGE_KERNEL;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
+		struct page *page;
+		pte_t old = *ptep;
+		pte_t entry;
+
+		prepare_vmemmap_page(huge);
+
+		entry = mk_pte(huge++, pgprot);
+		VM_WARN_ON(!pte_present(old));
+		page = pte_page(old);
+		list_add(&page->lru, free_pages);
+
+		set_pte_at(&init_mm, addr, ptep, entry);
+	}
+}
+
+static void replace_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
+					  struct page *huge,
+					  struct list_head *free_pages)
+{
+	unsigned long end = start + VMEMMAP_HPAGE_SIZE;
+
+	flush_cache_vunmap(start, end);
+	__replace_huge_page_pte_vmemmap(pte_offset_kernel(pmd, start), start,
+					VMEMMAP_HPAGE_NR, huge, free_pages);
+	flush_tlb_kernel_range(start, end);
+}
+
+static pte_t *merge_vmemmap_pte(pmd_t *pmdp, unsigned long addr)
+{
+	pte_t *pte;
+	struct page *page;
+
+	pte = pte_offset_kernel(pmdp, addr);
+	page = pte_page(*pte);
+	set_pmd(pmdp, vmemmap_pmd_mkhuge(page));
+
+	return pte;
+}
+
+static void merge_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
+					struct page *huge,
+					struct list_head *free_pages)
+{
+	replace_huge_page_pmd_vmemmap(pmd, start, huge, free_pages);
+	pte_free_kernel(&init_mm, merge_vmemmap_pte(pmd, start));
+	flush_tlb_kernel_range(start, start + VMEMMAP_HPAGE_SIZE);
+}
+
 static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
 {
 	int i;
@@ -1721,6 +1778,15 @@ static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
 	}
 }
 
+static inline void dissolve_compound_page(struct page *page, unsigned int order)
+{
+	int i;
+	unsigned int nr_pages = 1 << order;
+
+	for (i = 1; i < nr_pages; i++)
+		set_page_refcounted(page + i);
+}
+
 static void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 	pmd_t *pmd;
@@ -1738,10 +1804,48 @@ static void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 				    __remap_huge_page_pte_vmemmap);
 	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
 		/*
-		 * Todo:
-		 * Merge pte to huge pmd if it has ever been split.
+		 * Merge pte to huge pmd if it has ever been split. Now only
+		 * support gigantic page which's vmemmap pages size is an
+		 * integer multiple of PMD_SIZE. This is the simplest case
+		 * to handle.
 		 */
 		clear_pmd_split(pmd);
+
+		if (IS_ALIGNED(vmemmap_pages_per_hpage(h), VMEMMAP_HPAGE_NR)) {
+			unsigned long addr = (unsigned long)head;
+			unsigned long end = addr +
+					    vmemmap_pages_size_per_hpage(h);
+
+			spin_unlock(ptl);
+
+			for (; addr < end; addr += VMEMMAP_HPAGE_SIZE) {
+				void *to;
+				struct page *page;
+
+				page = alloc_pages(GFP_VMEMMAP_PAGE & ~__GFP_NOFAIL,
+						   VMEMMAP_HPAGE_ORDER);
+				if (!page)
+					goto out;
+
+				dissolve_compound_page(page,
+						       VMEMMAP_HPAGE_ORDER);
+				to = page_to_virt(page);
+				memcpy(to, (void *)addr, VMEMMAP_HPAGE_SIZE);
+
+				/*
+				 * Make sure that any data that writes to the
+				 * @to is made visible to the physical page.
+				 */
+				flush_kernel_vmap_range(to, VMEMMAP_HPAGE_SIZE);
+
+				merge_huge_page_pmd_vmemmap(pmd++, addr, page,
+							    &remap_pages);
+			}
+
+out:
+			free_vmemmap_page_list(&remap_pages);
+			return;
+		}
 	}
 	spin_unlock(ptl);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 20/21] mm/hugetlb: Gather discrete indexes of tail page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (18 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-08 14:11 ` [PATCH v3 21/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
  2020-11-10 19:23 ` [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Mike Kravetz
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

For hugetlb page, there are more metadata to save in the struct
page. But the head struct page cannot meet our needs, so we have
to abuse other tail struct page to store the metadata. In order
to avoid conflicts caused by subsequent use of more tail struct
pages, we can gather these discrete indexes of tail struct page
In this case, it will be easier to add a new tail page index later.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h        | 13 +++++++++++++
 include/linux/hugetlb_cgroup.h | 15 +++++++++------
 mm/hugetlb.c                   | 16 ++++++++--------
 3 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 32abfb420731..cb604b9dd649 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -28,6 +28,19 @@ typedef struct { unsigned long pd; } hugepd_t;
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+enum {
+	SUBPAGE_INDEX_ACTIVE = 1,	/* reuse page flags of PG_private */
+	SUBPAGE_INDEX_TEMPORARY,	/* reuse page->mapping */
+#ifdef CONFIG_CGROUP_HUGETLB
+	SUBPAGE_INDEX_CGROUP = SUBPAGE_INDEX_TEMPORARY,/* reuse page->private */
+	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	SUBPAGE_INDEX_HWPOISON,		/* reuse page->private */
+#endif
+	NR_USED_SUBPAGE,
+};
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2ad6e92f124a..3d3c1c49efe4 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -24,8 +24,9 @@ struct file_region;
 /*
  * Minimum page order trackable by hugetlb cgroup.
  * At least 4 pages are necessary for all the tracking information.
- * The second tail page (hpage[2]) is the fault usage cgroup.
- * The third tail page (hpage[3]) is the reservation usage cgroup.
+ * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
+ * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
+ * is the reservation usage cgroup.
  */
 #define HUGETLB_CGROUP_MIN_ORDER	2
 
@@ -66,9 +67,9 @@ __hugetlb_cgroup_from_page(struct page *page, bool rsvd)
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
 	if (rsvd)
-		return (struct hugetlb_cgroup *)page[3].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD);
 	else
-		return (struct hugetlb_cgroup *)page[2].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP);
 }
 
 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -90,9 +91,11 @@ static inline int __set_hugetlb_cgroup(struct page *page,
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
 	if (rsvd)
-		page[3].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD,
+				 (unsigned long)h_cg);
 	else
-		page[2].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP,
+				 (unsigned long)h_cg);
 	return 0;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 52e56c3a9b72..1dd1a9cec008 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1918,7 +1918,7 @@ static inline void subpage_hwpoison_deliver(struct page *head)
 	struct page *page = head;
 
 	if (PageHWPoison(head))
-		page = head + page_private(head + 4);
+		page = head + page_private(head + SUBPAGE_INDEX_HWPOISON);
 
 	/*
 	 * Move PageHWPoison flag from head page to the raw error page,
@@ -1933,7 +1933,7 @@ static inline void subpage_hwpoison_deliver(struct page *head)
 static inline void set_subpage_hwpoison(struct page *head, struct page *page)
 {
 	if (PageHWPoison(head))
-		set_page_private(head + 4, page - head);
+		set_page_private(head + SUBPAGE_INDEX_HWPOISON, page - head);
 }
 
 static int __init early_hugetlb_free_vmemmap_param(char *buf)
@@ -2074,20 +2074,20 @@ struct hstate *size_to_hstate(unsigned long size)
 bool page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);
-	return PageHead(page) && PagePrivate(&page[1]);
+	return PageHead(page) && PagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 /* never called for tail page */
 static void set_page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	SetPagePrivate(&page[1]);
+	SetPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 static void clear_page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	ClearPagePrivate(&page[1]);
+	ClearPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 /*
@@ -2099,17 +2099,17 @@ static inline bool PageHugeTemporary(struct page *page)
 	if (!PageHuge(page))
 		return false;
 
-	return (unsigned long)page[2].mapping == -1U;
+	return (unsigned long)page[SUBPAGE_INDEX_TEMPORARY].mapping == -1U;
 }
 
 static inline void SetPageHugeTemporary(struct page *page)
 {
-	page[2].mapping = (void *)-1U;
+	page[SUBPAGE_INDEX_TEMPORARY].mapping = (void *)-1U;
 }
 
 static inline void ClearPageHugeTemporary(struct page *page)
 {
-	page[2].mapping = NULL;
+	page[SUBPAGE_INDEX_TEMPORARY].mapping = NULL;
 }
 
 static void __free_huge_page(struct page *page)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 21/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (19 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 20/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
@ 2020-11-08 14:11 ` Muchun Song
  2020-11-10 19:23 ` [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Mike Kravetz
  21 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-08 14:11 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

There are only `RESERVE_VMEMMAP_SIZE / sizeof(struct page)` struct pages
can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON
to catch this invalid usage of tail struct page.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1dd1a9cec008..66b96705597a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3946,6 +3946,8 @@ static int __init hugetlb_init(void)
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 	BUILD_BUG_ON_NOT_POWER_OF_2(sizeof(struct page));
+	BUILD_BUG_ON(NR_USED_SUBPAGE >=
+		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
 #endif
 
 	if (!hugepages_supported()) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-08 14:10 ` [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
@ 2020-11-09 13:52   ` Oscar Salvador
  2020-11-09 14:20     ` [External] " Muchun Song
  2020-11-10 19:31     ` Mike Kravetz
  0 siblings, 2 replies; 54+ messages in thread
From: Oscar Salvador @ 2020-11-09 13:52 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> whether to enable the feature of freeing unused vmemmap associated
> with HugeTLB pages. Now only support x86.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  arch/x86/mm/init_64.c |  2 +-
>  fs/Kconfig            | 16 ++++++++++++++++
>  mm/bootmem_info.c     |  3 +--
>  3 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>  
>  static void __init register_page_bootmem_info(void)
>  {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>  	int i;
>  
>  	for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..21b8d39a9715 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,22 @@ config HUGETLBFS
>  config HUGETLB_PAGE
>  	def_bool HUGETLBFS
>  
> +config HUGETLB_PAGE_FREE_VMEMMAP
> +	bool "Free unused vmemmap associated with HugeTLB pages"
> +	default y
> +	depends on X86
> +	depends on HUGETLB_PAGE
> +	depends on SPARSEMEM_VMEMMAP
> +	depends on HAVE_BOOTMEM_INFO_NODE
> +	help
> +	  There are many struct page structures associated with each HugeTLB
> +	  page. But we only use a few struct page structures. In this case,
> +	  it wastes some memory. It is better to free the unused struct page
> +	  structures to buddy system which can save some memory. For
> +	  architectures that support it, say Y here.
> +
> +	  If unsure, say N.

I am not sure the above is useful for someone who needs to decide
whether he needs/wants to enable this or not.
I think the above fits better in a Documentation part.

I suck at this, but what about the following, or something along those
lines? 

"
When using SPARSEMEM_VMEMMAP, the system can save up some memory
from pre-allocated HugeTLB pages when they are not used.
6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
When the pages are going to be used or freed up, the vmemmap
array representing that range needs to be remapped again and
the pages we discarded earlier need to be rellocated again.
Therefore, this is a trade-off between saving memory and
increasing time in allocation/free path.
"

It would be also great to point out that this might be a
trade-off between saving up memory and increasing the cost
of certain operations on allocation/free path.
That is why I mentioned it there.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-09 13:52   ` Oscar Salvador
@ 2020-11-09 14:20     ` Muchun Song
  2020-11-10 19:31     ` Mike Kravetz
  1 sibling, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-09 14:20 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Mon, Nov 9, 2020 at 9:52 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> > The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> > whether to enable the feature of freeing unused vmemmap associated
> > with HugeTLB pages. Now only support x86.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  arch/x86/mm/init_64.c |  2 +-
> >  fs/Kconfig            | 16 ++++++++++++++++
> >  mm/bootmem_info.c     |  3 +--
> >  3 files changed, 18 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 0a45f062826e..0435bee2e172 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >
> >  static void __init register_page_bootmem_info(void)
> >  {
> > -#ifdef CONFIG_NUMA
> > +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> >       int i;
> >
> >       for_each_online_node(i)
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 976e8b9033c4..21b8d39a9715 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -245,6 +245,22 @@ config HUGETLBFS
> >  config HUGETLB_PAGE
> >       def_bool HUGETLBFS
> >
> > +config HUGETLB_PAGE_FREE_VMEMMAP
> > +     bool "Free unused vmemmap associated with HugeTLB pages"
> > +     default y
> > +     depends on X86
> > +     depends on HUGETLB_PAGE
> > +     depends on SPARSEMEM_VMEMMAP
> > +     depends on HAVE_BOOTMEM_INFO_NODE
> > +     help
> > +       There are many struct page structures associated with each HugeTLB
> > +       page. But we only use a few struct page structures. In this case,
> > +       it wastes some memory. It is better to free the unused struct page
> > +       structures to buddy system which can save some memory. For
> > +       architectures that support it, say Y here.
> > +
> > +       If unsure, say N.
>
> I am not sure the above is useful for someone who needs to decide
> whether he needs/wants to enable this or not.
> I think the above fits better in a Documentation part.
>
> I suck at this, but what about the following, or something along those
> lines?
>
> "
> When using SPARSEMEM_VMEMMAP, the system can save up some memory
> from pre-allocated HugeTLB pages when they are not used.
> 6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
> When the pages are going to be used or freed up, the vmemmap
> array representing that range needs to be remapped again and
> the pages we discarded earlier need to be rellocated again.
> Therefore, this is a trade-off between saving memory and
> increasing time in allocation/free path.
> "

Will do. Thanks for your suggestions.

>
> It would be also great to point out that this might be a
> trade-off between saving up memory and increasing the cost
> of certain operations on allocation/free path.
> That is why I mentioned it there.

OK, I will add this to the Documentation part, thanks.

>
> --
> Oscar Salvador
> SUSE L3



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-08 14:10 ` [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
@ 2020-11-09 16:48   ` Oscar Salvador
  2020-11-10  2:42     ` [External] " Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-09 16:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Nov 08, 2020 at 10:10:56PM +0800, Muchun Song wrote:
> +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> +/*
> + * There are 512 struct page structs(8 pages) associated with each 2MB
> + * hugetlb page. For tail pages, the value of compound_dtor is the same.
I gess you meant "For tail pages, the value of compound_head ...", right?

> + * So we can reuse first page of tail page structs. We map the virtual
> + * addresses of the remaining 6 pages of tail page structs to the first
> + * tail page struct, and then free these 6 pages. Therefore, we need to
> + * reserve at least 2 pages as vmemmap areas.
> + */
> +#define RESERVE_VMEMMAP_NR	2U
> +
> +static void __init hugetlb_vmemmap_init(struct hstate *h)
> +{
> +	unsigned int order = huge_page_order(h);
> +	unsigned int vmemmap_pages;
> +
> +	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
> +	/*
> +	 * The head page and the first tail page not free to buddy system,

"The head page and the first tail page are not to be freed to..." better?


> +	 * the others page will map to the first tail page. So there are
> +	 * (@vmemmap_pages - RESERVE_VMEMMAP_NR) pages can be freed.
						      ^^^
                                                      that

> +	else
> +		h->nr_free_vmemmap_pages = 0;

I would specify that this is not expected to happen.
(At least I could not come up with a real scenario unless the system is
corrupted)
So, I would drop a brief comment pointing out that it is only a safety
net.


Unrelated to this patch but related in general, I am not sure about Mike but
would it be cleaner to move all the vmemmap functions to hugetlb_vmemmap.c?
hugetlb code is quite tricky, so I am not sure about stuffing more code
in there.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-08 14:10 ` [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
@ 2020-11-09 17:21   ` Oscar Salvador
  2020-11-10  3:49     ` [External] " Muchun Song
  2020-11-11  0:47   ` Mike Kravetz
  1 sibling, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-09 17:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Nov 08, 2020 at 10:10:57PM +0800, Muchun Song wrote:
> +static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
> +{
> +	unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
> +
> +	/*
> +	 * No need pre-allocate page tabels when there is no vmemmap pages
> +	 * to free.
 s /tabels/tables/

> +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> +{
> +	int i;
> +	pgtable_t pgtable;
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	if (!nr)
> +		return 0;
> +
> +	vmemmap_pgtable_init(page);
> +
> +	for (i = 0; i < nr; i++) {
> +		pte_t *pte_p;
> +
> +		pte_p = pte_alloc_one_kernel(&init_mm);
> +		if (!pte_p)
> +			goto out;
> +		vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> +	}
> +
> +	return 0;
> +out:
> +	while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> +		pte_free_kernel(&init_mm, page_to_virt(pgtable));

	would not be enough to:

	while (pgtable = vmemmap_pgtable_withdrag(page))
		pte_free_kernel(&init_mm, page_to_virt(pgtable));

> +	return -ENOMEM;
> +}
> +
> +static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
> +{
> +	pgtable_t pgtable;
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	if (!nr)
> +		return;

We can get rid of "nr" and its check and keep only the check below, right?
AFAICS, they go together, e.g: if page_huge_pte does not return null,
it means that we preallocated a pagetable, and viceversa.


> +
> +	pgtable = page_huge_pte(page);
> +	if (!pgtable)
> +		return;
> +
> +	while (nr-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> +		pte_free_kernel(&init_mm, page_to_virt(pgtable));

	Same as above, that "nr" can go?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap
  2020-11-08 14:11 ` [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap Muchun Song
@ 2020-11-09 18:11   ` Oscar Salvador
  2020-11-10  5:17     ` [External] " Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-09 18:11 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Nov 08, 2020 at 10:11:00PM +0800, Muchun Song wrote:
> In the register_page_bootmem_memmap, the slab allocator is not ready
> yet. So when ALLOC_SPLIT_PTLOCKS, we use init_mm.page_table_lock.
> otherwise we use per page table lock(page->ptl). In the later patch,
> we will use the vmemmap page table lock to guard the splitting of
> the vmemmap huge PMD.

I am not sure about this one.
Grabbing init_mm's pagetable lock for specific hugetlb operations does not
seem like a good idea, and we do not know how contented is that one.

I think a better fit would be to find another hook to initialize
page_table_lock at a later stage.
Anyway, we do not need till we are going to perform an operation
on the range, right?

Unless I am missing something, this should be doable in hugetlb_init.

hugetlb_init is part from a init_call that gets called during do_initcalls.
At this time, slab is fully operative.

start_kernel
 kmem_cache_init_late
 kmem_cache_init_late
 ...
 arch_call_rest_init
  rest_init
   kernel_init_freeable
    do_basic_setup
     do_initcalls
      hugetlb_init

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-08 14:11 ` [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-09 18:51   ` Oscar Salvador
  2020-11-10  6:40     ` [External] " Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-09 18:51 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Nov 08, 2020 at 10:11:01PM +0800, Muchun Song wrote:
> +static inline int freed_vmemmap_hpage(struct page *page)
> +{
> +	return atomic_read(&page->_mapcount) + 1;
> +}
> +
> +static inline int freed_vmemmap_hpage_inc(struct page *page)
> +{
> +	return atomic_inc_return_relaxed(&page->_mapcount) + 1;
> +}
> +
> +static inline int freed_vmemmap_hpage_dec(struct page *page)
> +{
> +	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
> +}

Are these relaxed any different that the normal ones on x86_64? 
I got confused following the macros.

> +static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
> +					 unsigned long start,
> +					 unsigned int nr_free,
> +					 struct list_head *free_pages)
> +{
> +	/* Make the tail pages are mapped read-only. */
> +	pgprot_t pgprot = PAGE_KERNEL_RO;
> +	pte_t entry = mk_pte(reuse, pgprot);
> +	unsigned long addr;
> +	unsigned long end = start + (nr_free << PAGE_SHIFT);

See below.

> +static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
> +					 unsigned long addr,
> +					 struct list_head *free_pages)
> +{
> +	unsigned long next;
> +	unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
> +	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
> +	struct page *reuse = NULL;
> +
> +	addr = start;
> +	do {
> +		unsigned int nr_pages;
> +		pte_t *ptep;
> +
> +		ptep = pte_offset_kernel(pmd, addr);
> +		if (!reuse)
> +			reuse = pte_page(ptep[-1]);

Can we define a proper name for that instead of -1?

e.g: TAIL_PAGE_REUSE or something like that. 

> +
> +		next = vmemmap_hpage_addr_end(addr, end);
> +		nr_pages = (next - addr) >> PAGE_SHIFT;
> +		__free_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
> +					     free_pages);

Why not passing next instead of nr_pages? I think it makes more sense.
As a bonus we can kill the variable.

> +static void split_vmemmap_huge_page(struct hstate *h, struct page *head,
> +				    pmd_t *pmd)
> +{
> +	pgtable_t pgtable;
> +	unsigned long start = (unsigned long)head & VMEMMAP_HPAGE_MASK;
> +	unsigned long addr = start;
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	while (nr-- && (pgtable = vmemmap_pgtable_withdraw(head))) {

The same with previous patches, I would scrap "nr" and its use.

> +		VM_BUG_ON(freed_vmemmap_hpage(pgtable));

I guess here we want to check whether we already call free_huge_page_vmemmap
on this range?
For this to have happened, the locking should have failed, right?

> +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> +	pmd_t *pmd;
> +	spinlock_t *ptl;
> +	LIST_HEAD(free_pages);
> +
> +	if (!free_vmemmap_pages_per_hpage(h))
> +		return;
> +
> +	pmd = vmemmap_to_pmd(head);
> +	ptl = vmemmap_pmd_lock(pmd);
> +	if (vmemmap_pmd_huge(pmd)) {
> +		VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));

I think that checking for free_vmemmap_pages_per_hpage is enough.
In the end, pgtable_pages_to_prealloc_per_hpage uses free_vmemmap_pages_per_hpage.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-09 16:48   ` Oscar Salvador
@ 2020-11-10  2:42     ` Muchun Song
  2020-11-10 19:38       ` Mike Kravetz
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-10  2:42 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 12:48 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:10:56PM +0800, Muchun Song wrote:
> > +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> > +/*
> > + * There are 512 struct page structs(8 pages) associated with each 2MB
> > + * hugetlb page. For tail pages, the value of compound_dtor is the same.
> I gess you meant "For tail pages, the value of compound_head ...", right?

Yeah, Thanks.

>
> > + * So we can reuse first page of tail page structs. We map the virtual
> > + * addresses of the remaining 6 pages of tail page structs to the first
> > + * tail page struct, and then free these 6 pages. Therefore, we need to
> > + * reserve at least 2 pages as vmemmap areas.
> > + */
> > +#define RESERVE_VMEMMAP_NR   2U
> > +
> > +static void __init hugetlb_vmemmap_init(struct hstate *h)
> > +{
> > +     unsigned int order = huge_page_order(h);
> > +     unsigned int vmemmap_pages;
> > +
> > +     vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
> > +     /*
> > +      * The head page and the first tail page not free to buddy system,
>
> "The head page and the first tail page are not to be freed to..." better?

Yeah, sorry for my poor English :).

>
>
> > +      * the others page will map to the first tail page. So there are
> > +      * (@vmemmap_pages - RESERVE_VMEMMAP_NR) pages can be freed.
>                                                       ^^^
>                                                       that
>
> > +     else
> > +             h->nr_free_vmemmap_pages = 0;
>
> I would specify that this is not expected to happen.
> (At least I could not come up with a real scenario unless the system is
> corrupted)
> So, I would drop a brief comment pointing out that it is only a safety
> net.

I will add a comment to point out this.

>
>
> Unrelated to this patch but related in general, I am not sure about Mike but
> would it be cleaner to move all the vmemmap functions to hugetlb_vmemmap.c?
> hugetlb code is quite tricky, so I am not sure about stuffing more code
> in there.
>

I also think that you are right, moving all the vmemmap functions to
hugetlb_vmemmap.c may make the code cleaner.

Hi Mike, what's your opinion?

Thanks.

> --
> Oscar Salvador
> SUSE L3



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-09 17:21   ` Oscar Salvador
@ 2020-11-10  3:49     ` Muchun Song
  2020-11-10  5:42       ` Oscar Salvador
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-10  3:49 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 1:21 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:10:57PM +0800, Muchun Song wrote:
> > +static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
> > +{
> > +     unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
> > +
> > +     /*
> > +      * No need pre-allocate page tabels when there is no vmemmap pages
> > +      * to free.
>  s /tabels/tables/

Thanks.

>
> > +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > +{
> > +     int i;
> > +     pgtable_t pgtable;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     if (!nr)
> > +             return 0;
> > +
> > +     vmemmap_pgtable_init(page);
> > +
> > +     for (i = 0; i < nr; i++) {
> > +             pte_t *pte_p;
> > +
> > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > +             if (!pte_p)
> > +                     goto out;
> > +             vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> > +     }
> > +
> > +     return 0;
> > +out:
> > +     while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
>
>         would not be enough to:
>
>         while (pgtable = vmemmap_pgtable_withdrag(page))
>                 pte_free_kernel(&init_mm, page_to_virt(pgtable));

The vmemmap_pgtable_withdraw can not return NULL. So we can not
drop the "i--".

>
> > +     return -ENOMEM;
> > +}
> > +
> > +static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
> > +{
> > +     pgtable_t pgtable;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     if (!nr)
> > +             return;
>
> We can get rid of "nr" and its check and keep only the check below, right?

Great, the check can go away.

> AFAICS, they go together, e.g: if page_huge_pte does not return null,
> it means that we preallocated a pagetable, and viceversa.
>
>
> > +
> > +     pgtable = page_huge_pte(page);
> > +     if (!pgtable)
> > +             return;
> > +
> > +     while (nr-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
>
>         Same as above, that "nr" can go?

Here "nr" can not go. Because the vmemmap_pgtable_withdraw can
not return NULL.

Thanks.

>
> --
> Oscar Salvador
> SUSE L3



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap
  2020-11-09 18:11   ` Oscar Salvador
@ 2020-11-10  5:17     ` Muchun Song
  0 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-10  5:17 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 2:11 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:11:00PM +0800, Muchun Song wrote:
> > In the register_page_bootmem_memmap, the slab allocator is not ready
> > yet. So when ALLOC_SPLIT_PTLOCKS, we use init_mm.page_table_lock.
> > otherwise we use per page table lock(page->ptl). In the later patch,
> > we will use the vmemmap page table lock to guard the splitting of
> > the vmemmap huge PMD.
>
> I am not sure about this one.
> Grabbing init_mm's pagetable lock for specific hugetlb operations does not
> seem like a good idea, and we do not know how contented is that one.

These APIs are used to guard the operations on vmemmap page tables.
For now, it is only for specific hugetlb operations. But maybe in the future,
someone also wants to modify the vmemmap page tables, he also can
use these APIs. Yeah, we do not know how contented is init_mm's pagetable
lock. Grabbing this one may not be a good idea.

>
> I think a better fit would be to find another hook to initialize
> page_table_lock at a later stage.
> Anyway, we do not need till we are going to perform an operation
> on the range, right?

Yeah. You are right.

>
> Unless I am missing something, this should be doable in hugetlb_init.
>
> hugetlb_init is part from a init_call that gets called during do_initcalls.
> At this time, slab is fully operative.

If we initialize the page_table_lock in the hugetlb_init, we need to
walk the vmemmap page tables again. But the vmemmap pages
size is small, maybe the overhead of this is also small. And doing
this in hugetlb_init can make the code cleaner. Thanks very much.


>
> start_kernel
>  kmem_cache_init_late
>  kmem_cache_init_late
>  ...
>  arch_call_rest_init
>   rest_init
>    kernel_init_freeable
>     do_basic_setup
>      do_initcalls
>       hugetlb_init
>
> --
> Oscar Salvador
> SUSE L3



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-10  3:49     ` [External] " Muchun Song
@ 2020-11-10  5:42       ` Oscar Salvador
  2020-11-10  6:08         ` Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-10  5:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 11:49:27AM +0800, Muchun Song wrote:
> On Tue, Nov 10, 2020 at 1:21 AM Oscar Salvador <osalvador@suse.de> wrote:
> >
> > On Sun, Nov 08, 2020 at 10:10:57PM +0800, Muchun Song wrote:
> > > +static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
> > > +{
> > > +     unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
> > > +
> > > +     /*
> > > +      * No need pre-allocate page tabels when there is no vmemmap pages
> > > +      * to free.
> >  s /tabels/tables/
> 
> Thanks.
> 
> >
> > > +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > > +{
> > > +     int i;
> > > +     pgtable_t pgtable;
> > > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > > +
> > > +     if (!nr)
> > > +             return 0;
> > > +
> > > +     vmemmap_pgtable_init(page);
> > > +
> > > +     for (i = 0; i < nr; i++) {
> > > +             pte_t *pte_p;
> > > +
> > > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > > +             if (!pte_p)
> > > +                     goto out;
> > > +             vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> > > +     }
> > > +
> > > +     return 0;
> > > +out:
> > > +     while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
> >
> >         would not be enough to:
> >
> >         while (pgtable = vmemmap_pgtable_withdrag(page))
> >                 pte_free_kernel(&init_mm, page_to_virt(pgtable));
> 
> The vmemmap_pgtable_withdraw can not return NULL. So we can not
> drop the "i--".

Yeah, you are right, I managed to confuse myself.
But why not make it return null, something like:

static pgtable_t vmemmap_pgtable_withdraw(struct page *page)
{
	pgtable_t pgtable;

	/* FIFO */
	pgtable = page_huge_pte(page);
	page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
						       struct page, lru);
	if (page_huge_pte(page))
		list_del(&pgtable->lru);

	return page_huge_pte(page) ? pgtable : NULL;
}

What do you think?


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-10  5:42       ` Oscar Salvador
@ 2020-11-10  6:08         ` Muchun Song
  2020-11-10  6:33           ` Oscar Salvador
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-10  6:08 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 1:42 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Tue, Nov 10, 2020 at 11:49:27AM +0800, Muchun Song wrote:
> > On Tue, Nov 10, 2020 at 1:21 AM Oscar Salvador <osalvador@suse.de> wrote:
> > >
> > > On Sun, Nov 08, 2020 at 10:10:57PM +0800, Muchun Song wrote:
> > > > +static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
> > > > +{
> > > > +     unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
> > > > +
> > > > +     /*
> > > > +      * No need pre-allocate page tabels when there is no vmemmap pages
> > > > +      * to free.
> > >  s /tabels/tables/
> >
> > Thanks.
> >
> > >
> > > > +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > > > +{
> > > > +     int i;
> > > > +     pgtable_t pgtable;
> > > > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > > > +
> > > > +     if (!nr)
> > > > +             return 0;
> > > > +
> > > > +     vmemmap_pgtable_init(page);
> > > > +
> > > > +     for (i = 0; i < nr; i++) {
> > > > +             pte_t *pte_p;
> > > > +
> > > > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > > > +             if (!pte_p)
> > > > +                     goto out;
> > > > +             vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> > > > +     }
> > > > +
> > > > +     return 0;
> > > > +out:
> > > > +     while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > > > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
> > >
> > >         would not be enough to:
> > >
> > >         while (pgtable = vmemmap_pgtable_withdrag(page))
> > >                 pte_free_kernel(&init_mm, page_to_virt(pgtable));
> >
> > The vmemmap_pgtable_withdraw can not return NULL. So we can not
> > drop the "i--".
>
> Yeah, you are right, I managed to confuse myself.
> But why not make it return null, something like:
>
> static pgtable_t vmemmap_pgtable_withdraw(struct page *page)
> {
>         pgtable_t pgtable;
>
>         /* FIFO */
>         pgtable = page_huge_pte(page);

The check should be added here.

           if (!pgtable)
                   return NULL;

Just like my previous v2 patch does. In this case, we can drop those
checks. What do you think?

>         page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
>                                                        struct page, lru);
>         if (page_huge_pte(page))
>                 list_del(&pgtable->lru);
>
>         return page_huge_pte(page) ? pgtable : NULL;
> }
>
> What do you think?
>
>
> --
> Oscar Salvador
> SUSE L3



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-10  6:08         ` Muchun Song
@ 2020-11-10  6:33           ` Oscar Salvador
  2020-11-10  7:10             ` Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-10  6:33 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 02:08:46PM +0800, Muchun Song wrote:
> The check should be added here.
> 
>            if (!pgtable)
>                    return NULL;
> 
> Just like my previous v2 patch does. In this case, we can drop those
> checks. What do you think?

It is too early for me, so bear with me.

page_huge_pte will only return NULL in case we did not get to preallocate
any pgtable right?

What I was talimg about is that 
> 
> >         page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
> >                                                        struct page, lru);

here we will get the either a pgtable entry or NULL in case we already consumed
all entries from the list.
If that is the case, we can return NULL and let the caller known that we
are done.

Am I missing anything?


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-09 18:51   ` Oscar Salvador
@ 2020-11-10  6:40     ` Muchun Song
  2020-11-10  9:48       ` Oscar Salvador
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-10  6:40 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 2:51 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Sun, Nov 08, 2020 at 10:11:01PM +0800, Muchun Song wrote:
> > +static inline int freed_vmemmap_hpage(struct page *page)
> > +{
> > +     return atomic_read(&page->_mapcount) + 1;
> > +}
> > +
> > +static inline int freed_vmemmap_hpage_inc(struct page *page)
> > +{
> > +     return atomic_inc_return_relaxed(&page->_mapcount) + 1;
> > +}
> > +
> > +static inline int freed_vmemmap_hpage_dec(struct page *page)
> > +{
> > +     return atomic_dec_return_relaxed(&page->_mapcount) + 1;
> > +}
>
> Are these relaxed any different that the normal ones on x86_64?
> I got confused following the macros.

A PTE table can contain 64 HugeTLB(2MB) page's struct page structures.
So I use the freed_vmemmap_hpage to indicate how many HugeTLB pages
that it's vmemmap pages are already freed to buddy.

Once vmemmap pages of a HugeTLB page are freed, we call the
freed_vmemmap_hpage_inc, when freeing a HugeTLB to the buddy,
we should call freed_vmemmap_hpage_dec.

If the freed_vmemmap_hpage hit zero when free HugeTLB, we try to merge
the PTE table to PMD(now only support gigantic pages). This can refer to

  [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic

Thanks.

>
> > +static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
> > +                                      unsigned long start,
> > +                                      unsigned int nr_free,
> > +                                      struct list_head *free_pages)
> > +{
> > +     /* Make the tail pages are mapped read-only. */
> > +     pgprot_t pgprot = PAGE_KERNEL_RO;
> > +     pte_t entry = mk_pte(reuse, pgprot);
> > +     unsigned long addr;
> > +     unsigned long end = start + (nr_free << PAGE_SHIFT);
>
> See below.
>
> > +static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
> > +                                      unsigned long addr,
> > +                                      struct list_head *free_pages)
> > +{
> > +     unsigned long next;
> > +     unsigned long start = addr + RESERVE_VMEMMAP_NR * PAGE_SIZE;
> > +     unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
> > +     struct page *reuse = NULL;
> > +
> > +     addr = start;
> > +     do {
> > +             unsigned int nr_pages;
> > +             pte_t *ptep;
> > +
> > +             ptep = pte_offset_kernel(pmd, addr);
> > +             if (!reuse)
> > +                     reuse = pte_page(ptep[-1]);
>
> Can we define a proper name for that instead of -1?
>
> e.g: TAIL_PAGE_REUSE or something like that.

OK, will do.

>
> > +
> > +             next = vmemmap_hpage_addr_end(addr, end);
> > +             nr_pages = (next - addr) >> PAGE_SHIFT;
> > +             __free_huge_page_pte_vmemmap(reuse, ptep, addr, nr_pages,
> > +                                          free_pages);
>
> Why not passing next instead of nr_pages? I think it makes more sense.
> As a bonus we can kill the variable.

Good catch. We can pass next instead of nr_pages. Thanks.


>
> > +static void split_vmemmap_huge_page(struct hstate *h, struct page *head,
> > +                                 pmd_t *pmd)
> > +{
> > +     pgtable_t pgtable;
> > +     unsigned long start = (unsigned long)head & VMEMMAP_HPAGE_MASK;
> > +     unsigned long addr = start;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     while (nr-- && (pgtable = vmemmap_pgtable_withdraw(head))) {
>
> The same with previous patches, I would scrap "nr" and its use.
>
> > +             VM_BUG_ON(freed_vmemmap_hpage(pgtable));
>
> I guess here we want to check whether we already call free_huge_page_vmemmap
> on this range?
> For this to have happened, the locking should have failed, right?

Only the first HugeTLB page should split the PMD to PTE. The other 63
HugeTLB pages
do not need to split. Here I want to make sure we are the first.

>
> > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > +{
> > +     pmd_t *pmd;
> > +     spinlock_t *ptl;
> > +     LIST_HEAD(free_pages);
> > +
> > +     if (!free_vmemmap_pages_per_hpage(h))
> > +             return;
> > +
> > +     pmd = vmemmap_to_pmd(head);
> > +     ptl = vmemmap_pmd_lock(pmd);
> > +     if (vmemmap_pmd_huge(pmd)) {
> > +             VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
>
> I think that checking for free_vmemmap_pages_per_hpage is enough.
> In the end, pgtable_pages_to_prealloc_per_hpage uses free_vmemmap_pages_per_hpage.

The free_vmemmap_pages_per_hpage is not enough. See the comments above.

Thanks.

>
>
> --
> Oscar Salvador
> SUSE L3



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-10  6:33           ` Oscar Salvador
@ 2020-11-10  7:10             ` Muchun Song
  0 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-10  7:10 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 2:33 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Tue, Nov 10, 2020 at 02:08:46PM +0800, Muchun Song wrote:
> > The check should be added here.
> >
> >            if (!pgtable)
> >                    return NULL;
> >
> > Just like my previous v2 patch does. In this case, we can drop those
> > checks. What do you think?
>
> It is too early for me, so bear with me.
>
> page_huge_pte will only return NULL in case we did not get to preallocate
> any pgtable right?

The page_huge_pte only returns NULL when we did consume the
page tables. Not each HugeTLB page need to split the vmemmap
page tables. We preallocate page tables for each HugeTLB page,
if we do not need to split PMD. We should free the preallocated
page tables.

Maybe you can see the comments of the other thread.

  [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with
each hugetlb page

Thanks.

>
> What I was talimg about is that
> >
> > >         page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
> > >                                                        struct page, lru);
>
> here we will get the either a pgtable entry or NULL in case we already consumed
> all entries from the list.
> If that is the case, we can return NULL and let the caller known that we
> are done.
>
> Am I missing anything?


>
>
> --
> Oscar Salvador
> SUSE L3



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-10  6:40     ` [External] " Muchun Song
@ 2020-11-10  9:48       ` Oscar Salvador
  2020-11-10 10:47         ` Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-10  9:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 02:40:54PM +0800, Muchun Song wrote:
> Only the first HugeTLB page should split the PMD to PTE. The other 63
> HugeTLB pages
> do not need to split. Here I want to make sure we are the first.

I think terminology is loosing me here.

Say you allocate a 2MB HugeTLB page at ffffea0004100000.

The vmemmap range that the represents this is ffffea0004000000 - ffffea0004200000.
That is a 2MB chunk PMD-mapped.
So, in order to free some of those vmemmap pages, we need to break down
that area, remapping it to PTE-based.
I know what you mean, but we are not really splitting hugetlg pages, but
the memmap range they are represented with.

About:

"Only the first HugeTLB page should split the PMD to PTE. The other 63
HugeTLB pages
do not need to split. Here I want to make sure we are the first."

That only refers to gigantic pages, right?

> > > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > > +{
> > > +     pmd_t *pmd;
> > > +     spinlock_t *ptl;
> > > +     LIST_HEAD(free_pages);
> > > +
> > > +     if (!free_vmemmap_pages_per_hpage(h))
> > > +             return;
> > > +
> > > +     pmd = vmemmap_to_pmd(head);
> > > +     ptl = vmemmap_pmd_lock(pmd);
> > > +     if (vmemmap_pmd_huge(pmd)) {
> > > +             VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
> >
> > I think that checking for free_vmemmap_pages_per_hpage is enough.
> > In the end, pgtable_pages_to_prealloc_per_hpage uses free_vmemmap_pages_per_hpage.
> 
> The free_vmemmap_pages_per_hpage is not enough. See the comments above.

My comment was about the VM_BUG_ON.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-10  9:48       ` Oscar Salvador
@ 2020-11-10 10:47         ` Muchun Song
  2020-11-10 13:52           ` Oscar Salvador
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-10 10:47 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 5:48 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Tue, Nov 10, 2020 at 02:40:54PM +0800, Muchun Song wrote:
> > Only the first HugeTLB page should split the PMD to PTE. The other 63
> > HugeTLB pages
> > do not need to split. Here I want to make sure we are the first.
>
> I think terminology is loosing me here.
>
> Say you allocate a 2MB HugeTLB page at ffffea0004100000.
>
> The vmemmap range that the represents this is ffffea0004000000 - ffffea0004200000.
> That is a 2MB chunk PMD-mapped.
> So, in order to free some of those vmemmap pages, we need to break down
> that area, remapping it to PTE-based.
> I know what you mean, but we are not really splitting hugetlg pages, but
> the memmap range they are represented with.

Yeah, you are right. We are splitting the vmemmap instead of hugetlb.
Sorry for the confusion.

>
> About:
>
> "Only the first HugeTLB page should split the PMD to PTE. The other 63
> HugeTLB pages
> do not need to split. Here I want to make sure we are the first."
>
> That only refers to gigantic pages, right?

Yeah, now it only refers to gigantic pages. Originally, I also wanted to merge
vmemmap PTE to PMD for normal 2MB HugeTLB pages. So I introduced
those macros(e.g. freed_vmemmap_hpage). For 2MB HugeTLB pages, I
haven't found an elegant solution. Hopefully, when you or someone have
read all of the patch series, we can come up with an elegant solution to
merge PTE.

Thanks.

>
> > > > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > > > +{
> > > > +     pmd_t *pmd;
> > > > +     spinlock_t *ptl;
> > > > +     LIST_HEAD(free_pages);
> > > > +
> > > > +     if (!free_vmemmap_pages_per_hpage(h))
> > > > +             return;
> > > > +
> > > > +     pmd = vmemmap_to_pmd(head);
> > > > +     ptl = vmemmap_pmd_lock(pmd);
> > > > +     if (vmemmap_pmd_huge(pmd)) {
> > > > +             VM_BUG_ON(!pgtable_pages_to_prealloc_per_hpage(h));
> > >
> > > I think that checking for free_vmemmap_pages_per_hpage is enough.
> > > In the end, pgtable_pages_to_prealloc_per_hpage uses free_vmemmap_pages_per_hpage.
> >
> > The free_vmemmap_pages_per_hpage is not enough. See the comments above.
>
> My comment was about the VM_BUG_ON.

Sorry, yeah, we can drop it. Thanks.

>
>
> --
> Oscar Salvador
> SUSE L3



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-10 10:47         ` Muchun Song
@ 2020-11-10 13:52           ` Oscar Salvador
  2020-11-10 14:00             ` Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Oscar Salvador @ 2020-11-10 13:52 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 06:47:08PM +0800, Muchun Song wrote:
> > That only refers to gigantic pages, right?
> 
> Yeah, now it only refers to gigantic pages. Originally, I also wanted to merge
> vmemmap PTE to PMD for normal 2MB HugeTLB pages. So I introduced
> those macros(e.g. freed_vmemmap_hpage). For 2MB HugeTLB pages, I
> haven't found an elegant solution. Hopefully, when you or someone have
> read all of the patch series, we can come up with an elegant solution to
> merge PTE.

Well, it is quite a lot of "tricky" code, so it takes some time.

> > > > > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > > > > +{
> > > > > +     pmd_t *pmd;
> > > > > +     spinlock_t *ptl;
> > > > > +     LIST_HEAD(free_pages);
> > > > > +
> > > > > +     if (!free_vmemmap_pages_per_hpage(h))
> > > > > +             return;
> > > > > +
> > > > > +     pmd = vmemmap_to_pmd(head);
> > > > > +     ptl = vmemmap_pmd_lock(pmd);

I forgot about this one.
You might want to check whether vmemmap_to_pmd returns NULL or not.
If it does means that something went wrong anyways, but still we should handle
such case (and print a fat warning or something like that).


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-10 13:52           ` Oscar Salvador
@ 2020-11-10 14:00             ` Muchun Song
  0 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-10 14:00 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Tue, Nov 10, 2020 at 9:52 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Tue, Nov 10, 2020 at 06:47:08PM +0800, Muchun Song wrote:
> > > That only refers to gigantic pages, right?
> >
> > Yeah, now it only refers to gigantic pages. Originally, I also wanted to merge
> > vmemmap PTE to PMD for normal 2MB HugeTLB pages. So I introduced
> > those macros(e.g. freed_vmemmap_hpage). For 2MB HugeTLB pages, I
> > haven't found an elegant solution. Hopefully, when you or someone have
> > read all of the patch series, we can come up with an elegant solution to
> > merge PTE.
>
> Well, it is quite a lot of "tricky" code, so it takes some time.
>
> > > > > > +static void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > > > > > +{
> > > > > > +     pmd_t *pmd;
> > > > > > +     spinlock_t *ptl;
> > > > > > +     LIST_HEAD(free_pages);
> > > > > > +
> > > > > > +     if (!free_vmemmap_pages_per_hpage(h))
> > > > > > +             return;
> > > > > > +
> > > > > > +     pmd = vmemmap_to_pmd(head);
> > > > > > +     ptl = vmemmap_pmd_lock(pmd);
>
> I forgot about this one.
> You might want to check whether vmemmap_to_pmd returns NULL or not.
> If it does means that something went wrong anyways, but still we should handle
> such case (and print a fat warning or something like that).

Yeah, maybe add a BUG_ON or WARN_ON. Do you think which one
would be more suitable?


>
>
> --
> Oscar Salvador
> SUSE L3



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 00/21] Free some vmemmap pages of hugetlb page
  2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (20 preceding siblings ...)
  2020-11-08 14:11 ` [PATCH v3 21/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
@ 2020-11-10 19:23 ` Mike Kravetz
  2020-11-11  3:21   ` [External] " Muchun Song
  21 siblings, 1 reply; 54+ messages in thread
From: Mike Kravetz @ 2020-11-10 19:23 UTC (permalink / raw)
  To: Muchun Song, corbet, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel


Thanks for continuing to work this Muchun!

On 11/8/20 6:10 AM, Muchun Song wrote:
...
> For tail pages, the value of compound_head is the same. So we can reuse
> first page of tail page structs. We map the virtual addresses of the
> remaining 6 pages of tail page structs to the first tail page struct,
> and then free these 6 pages. Therefore, we need to reserve at least 2
> pages as vmemmap areas.
> 
> When a hugetlbpage is freed to the buddy system, we should allocate six
> pages for vmemmap pages and restore the previous mapping relationship.
> 
> If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very
> substantial gain.

Is that 4095 number accurate?  Are we not using two pages of struct pages
as in the 2MB case?

Also, because we are splitting the huge page mappings in the vmemmap
additional PTE pages will need to be allocated.  Therefore, some additional
page table pages may need to be allocated so that we can free the pages
of struct pages.  The net savings may be less than what is stated above.

Perhaps this should mention that allocation of additional page table pages
may be required?

...
> Because there are vmemmap page tables reconstruction on the freeing/allocating
> path, it increases some overhead. Here are some overhead analysis.
> 
> 1) Allocating 10240 2MB hugetlb pages.
> 
>    a) With this patch series applied:
>    # time echo 10240 > /proc/sys/vm/nr_hugepages
> 
>    real     0m0.166s
>    user     0m0.000s
>    sys      0m0.166s
> 
>    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>    Attaching 2 probes...
> 
>    @latency:
>    [8K, 16K)           8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>    [16K, 32K)          1868 |@@@@@@@@@@@                                         |
>    [32K, 64K)            10 |                                                    |
>    [64K, 128K)            2 |                                                    |
> 
>    b) Without this patch series:
>    # time echo 10240 > /proc/sys/vm/nr_hugepages
> 
>    real     0m0.066s
>    user     0m0.000s
>    sys      0m0.066s
> 
>    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>    Attaching 2 probes...
> 
>    @latency:
>    [4K, 8K)           10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>    [8K, 16K)             62 |                                                    |
>    [16K, 32K)             2 |                                                    |
> 
>    Summarize: this feature is about ~2x slower than before.
> 
> 2) Freeing 10240 @MB hugetlb pages.
> 
>    a) With this patch series applied:
>    # time echo 0 > /proc/sys/vm/nr_hugepages
> 
>    real     0m0.004s
>    user     0m0.000s
>    sys      0m0.002s
> 
>    # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>    Attaching 2 probes...
> 
>    @latency:
>    [16K, 32K)         10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
>    b) Without this patch series:
>    # time echo 0 > /proc/sys/vm/nr_hugepages
> 
>    real     0m0.077s
>    user     0m0.001s
>    sys      0m0.075s
> 
>    # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>    Attaching 2 probes...
> 
>    @latency:
>    [4K, 8K)            9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>    [8K, 16K)            287 |@                                                   |
>    [16K, 32K)             3 |                                                    |
> 
>    Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
>               But according to the allocation test above, I think that here is
> 	      also ~2x slower than before.
> 
>               But why the 'real' time of patched is smaller than before? Because
> 	      In this patch series, the freeing hugetlb is asynchronous(through
> 	      kwoker).
> 
> Although the overhead has increased. But the overhead is not on the
> allocating/freeing of each hugetlb page, it is only once when we reserve
> some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation
> is successful, the subsequent allocating, freeing and using are the same
> as before (not patched). So I think that the overhead is acceptable.

Thank you for benchmarking.  There are still some instances where huge pages
are allocated 'on the fly' instead of being pulled from the pool.  Michal
pointed out the case of page migration.  It is also possible for someone to
use hugetlbfs without pre-allocating huge pages to the pool.  I remember the
use case pointed out in commit 099730d67417.  It says, "I have a hugetlbfs
user which is never explicitly allocating huge pages with 'nr_hugepages'.
They only set 'nr_overcommit_hugepages' and then let the pages be allocated
from the buddy allocator at fault time."  In this case, I suspect they were
using 'page fault' allocation for initialization much like someone using
/proc/sys/vm/nr_hugepages.  So, the overhead may not be as noticeable.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-09 13:52   ` Oscar Salvador
  2020-11-09 14:20     ` [External] " Muchun Song
@ 2020-11-10 19:31     ` Mike Kravetz
  2020-11-10 19:50       ` Matthew Wilcox
  2020-11-11  3:28       ` Muchun Song
  1 sibling, 2 replies; 54+ messages in thread
From: Mike Kravetz @ 2020-11-10 19:31 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song
  Cc: corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz,
	viro, akpm, paulmck, mchehab+huawei, pawan.kumar.gupta, rdunlap,
	oneukum, anshuman.khandual, jroedel, almasrymina, rientjes,
	willy, mhocko, duanxiongchun, linux-doc, linux-kernel, linux-mm,
	linux-fsdevel

On 11/9/20 5:52 AM, Oscar Salvador wrote:
> On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
>> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
>> whether to enable the feature of freeing unused vmemmap associated
>> with HugeTLB pages. Now only support x86.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>  arch/x86/mm/init_64.c |  2 +-
>>  fs/Kconfig            | 16 ++++++++++++++++
>>  mm/bootmem_info.c     |  3 +--
>>  3 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 0a45f062826e..0435bee2e172 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>>  
>>  static void __init register_page_bootmem_info(void)
>>  {
>> -#ifdef CONFIG_NUMA
>> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>>  	int i;
>>  
>>  	for_each_online_node(i)
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 976e8b9033c4..21b8d39a9715 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -245,6 +245,22 @@ config HUGETLBFS
>>  config HUGETLB_PAGE
>>  	def_bool HUGETLBFS
>>  
>> +config HUGETLB_PAGE_FREE_VMEMMAP
>> +	bool "Free unused vmemmap associated with HugeTLB pages"
>> +	default y
>> +	depends on X86
>> +	depends on HUGETLB_PAGE
>> +	depends on SPARSEMEM_VMEMMAP
>> +	depends on HAVE_BOOTMEM_INFO_NODE
>> +	help
>> +	  There are many struct page structures associated with each HugeTLB
>> +	  page. But we only use a few struct page structures. In this case,
>> +	  it wastes some memory. It is better to free the unused struct page
>> +	  structures to buddy system which can save some memory. For
>> +	  architectures that support it, say Y here.
>> +
>> +	  If unsure, say N.
> 
> I am not sure the above is useful for someone who needs to decide
> whether he needs/wants to enable this or not.
> I think the above fits better in a Documentation part.
> 
> I suck at this, but what about the following, or something along those
> lines? 
> 
> "
> When using SPARSEMEM_VMEMMAP, the system can save up some memory
> from pre-allocated HugeTLB pages when they are not used.
> 6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
> When the pages are going to be used or freed up, the vmemmap
> array representing that range needs to be remapped again and
> the pages we discarded earlier need to be rellocated again.
> Therefore, this is a trade-off between saving memory and
> increasing time in allocation/free path.
> "
> 
> It would be also great to point out that this might be a
> trade-off between saving up memory and increasing the cost
> of certain operations on allocation/free path.
> That is why I mentioned it there.

Yes, this is somewhat a trade-off.

As a config option, this is something that would likely be decided by
distros.  I almost hate to suggest this, but is it something that an
end user would want to decide?  Is this something that perhaps should
be a boot/kernel command line option?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-10  2:42     ` [External] " Muchun Song
@ 2020-11-10 19:38       ` Mike Kravetz
  2020-11-11  3:22         ` Muchun Song
  0 siblings, 1 reply; 54+ messages in thread
From: Mike Kravetz @ 2020-11-10 19:38 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Michal Hocko, Xiongchun duan, linux-doc, LKML,
	Linux Memory Management List, linux-fsdevel

On 11/9/20 6:42 PM, Muchun Song wrote:
> On Tue, Nov 10, 2020 at 12:48 AM Oscar Salvador <osalvador@suse.de> wrote:
>>
>> On Sun, Nov 08, 2020 at 10:10:56PM +0800, Muchun Song wrote:
>>
>> Unrelated to this patch but related in general, I am not sure about Mike but
>> would it be cleaner to move all the vmemmap functions to hugetlb_vmemmap.c?
>> hugetlb code is quite tricky, so I am not sure about stuffing more code
>> in there.
>>
> 
> I also think that you are right, moving all the vmemmap functions to
> hugetlb_vmemmap.c may make the code cleaner.
> 
> Hi Mike, what's your opinion?

I would be happy to see this in a separate file.  As Oscar mentions, the
hugetlb.c file/code is already somethat difficult to read and understand.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-10 19:31     ` Mike Kravetz
@ 2020-11-10 19:50       ` Matthew Wilcox
  2020-11-10 20:30         ` Mike Kravetz
  2020-11-17 15:35         ` [External] " Muchun Song
  2020-11-11  3:28       ` Muchun Song
  1 sibling, 2 replies; 54+ messages in thread
From: Matthew Wilcox @ 2020-11-10 19:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, Muchun Song, corbet, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On Tue, Nov 10, 2020 at 11:31:31AM -0800, Mike Kravetz wrote:
> On 11/9/20 5:52 AM, Oscar Salvador wrote:
> > On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> >> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> >> whether to enable the feature of freeing unused vmemmap associated
> >> with HugeTLB pages. Now only support x86.
> >>
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> ---
> >>  arch/x86/mm/init_64.c |  2 +-
> >>  fs/Kconfig            | 16 ++++++++++++++++
> >>  mm/bootmem_info.c     |  3 +--
> >>  3 files changed, 18 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 0a45f062826e..0435bee2e172 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >>  
> >>  static void __init register_page_bootmem_info(void)
> >>  {
> >> -#ifdef CONFIG_NUMA
> >> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> >>  	int i;
> >>  
> >>  	for_each_online_node(i)
> >> diff --git a/fs/Kconfig b/fs/Kconfig
> >> index 976e8b9033c4..21b8d39a9715 100644
> >> --- a/fs/Kconfig
> >> +++ b/fs/Kconfig
> >> @@ -245,6 +245,22 @@ config HUGETLBFS
> >>  config HUGETLB_PAGE
> >>  	def_bool HUGETLBFS
> >>  
> >> +config HUGETLB_PAGE_FREE_VMEMMAP
> >> +	bool "Free unused vmemmap associated with HugeTLB pages"
> >> +	default y
> >> +	depends on X86
> >> +	depends on HUGETLB_PAGE
> >> +	depends on SPARSEMEM_VMEMMAP
> >> +	depends on HAVE_BOOTMEM_INFO_NODE
> >> +	help
> >> +	  There are many struct page structures associated with each HugeTLB
> >> +	  page. But we only use a few struct page structures. In this case,
> >> +	  it wastes some memory. It is better to free the unused struct page
> >> +	  structures to buddy system which can save some memory. For
> >> +	  architectures that support it, say Y here.
> >> +
> >> +	  If unsure, say N.
> > 
> > I am not sure the above is useful for someone who needs to decide
> > whether he needs/wants to enable this or not.
> > I think the above fits better in a Documentation part.
> > 
> > I suck at this, but what about the following, or something along those
> > lines? 
> > 
> > "
> > When using SPARSEMEM_VMEMMAP, the system can save up some memory
> > from pre-allocated HugeTLB pages when they are not used.
> > 6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
> > When the pages are going to be used or freed up, the vmemmap
> > array representing that range needs to be remapped again and
> > the pages we discarded earlier need to be rellocated again.
> > Therefore, this is a trade-off between saving memory and
> > increasing time in allocation/free path.
> > "
> > 
> > It would be also great to point out that this might be a
> > trade-off between saving up memory and increasing the cost
> > of certain operations on allocation/free path.
> > That is why I mentioned it there.
> 
> Yes, this is somewhat a trade-off.
> 
> As a config option, this is something that would likely be decided by
> distros.  I almost hate to suggest this, but is it something that an
> end user would want to decide?  Is this something that perhaps should
> be a boot/kernel command line option?

I don't like config options.  I like boot options even less.  I don't
know how to describe to an end-user whether they should select this
or not.  Is there a way to make this not a tradeoff?  Or make the
tradeoff so minimal as to be not worth describing?  (do we have numbers
for the worst possible situation when enabling this option?)

I haven't read through these patches in detail, so maybe we do this
already, but when we free the pages to the buddy allocator, do we retain
the third page to use for the PTEs (and free pages 3-7), or do we allocate
a separate page for the PTES and free pages 2-7?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-10 19:50       ` Matthew Wilcox
@ 2020-11-10 20:30         ` Mike Kravetz
  2020-11-17 15:35         ` [External] " Muchun Song
  1 sibling, 0 replies; 54+ messages in thread
From: Mike Kravetz @ 2020-11-10 20:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Oscar Salvador, Muchun Song, corbet, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, mhocko, duanxiongchun, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel

On 11/10/20 11:50 AM, Matthew Wilcox wrote:
> On Tue, Nov 10, 2020 at 11:31:31AM -0800, Mike Kravetz wrote:
>> On 11/9/20 5:52 AM, Oscar Salvador wrote:
>>> On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> 
> I don't like config options.  I like boot options even less.  I don't
> know how to describe to an end-user whether they should select this
> or not.  Is there a way to make this not a tradeoff?  Or make the
> tradeoff so minimal as to be not worth describing?  (do we have numbers
> for the worst possible situation when enabling this option?)

It is not exactly worst case, but Muchun provided some simple benchmarking
results in the cover letter.  Quick summary is that hugetlb page creation
and free time is "~2x slower".  At first glance, one would say that is
terrible.  However, remember that the majority of use cases create hugetlb
pages at or shortly after boot time and add them to the pool.  So, additional
overhead is at pool creation time.  There is no change to 'normal run time'
operations of getting a page from or returning a page to the pool (think
page fault/unmap).

> I haven't read through these patches in detail, so maybe we do this
> already, but when we free the pages to the buddy allocator, do we retain
> the third page to use for the PTEs (and free pages 3-7), or do we allocate
> a separate page for the PTES and free pages 2-7?

I haven't got there in this latest series.  But, in previous revisions the
code did allocate a separate page.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-08 14:10 ` [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
  2020-11-09 17:21   ` Oscar Salvador
@ 2020-11-11  0:47   ` Mike Kravetz
  2020-11-11  3:41     ` [External] " Muchun Song
  1 sibling, 1 reply; 54+ messages in thread
From: Mike Kravetz @ 2020-11-11  0:47 UTC (permalink / raw)
  To: Muchun Song, corbet, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 11/8/20 6:10 AM, Muchun Song wrote:
> On x86_64, vmemmap is always PMD mapped if the machine has hugepages
> support and if we have 2MB contiguos pages and PMD aligned. If we want
> to free the unused vmemmap pages, we have to split the huge pmd firstly.
> So we should pre-allocate pgtable to split PMD to PTE.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a0007902fafb..5c7be2ee7e15 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1303,6 +1303,108 @@ static inline void destroy_compound_gigantic_page(struct page *page,
...
> +static inline void vmemmap_pgtable_init(struct page *page)
> +{
> +	page_huge_pte(page) = NULL;
> +}
> +
> +static void vmemmap_pgtable_deposit(struct page *page, pgtable_t pgtable)
> +{
> +	/* FIFO */
> +	if (!page_huge_pte(page))
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &page_huge_pte(page)->lru);
> +	page_huge_pte(page) = pgtable;
> +}
> +
> +static pgtable_t vmemmap_pgtable_withdraw(struct page *page)
> +{
> +	pgtable_t pgtable;
> +
> +	/* FIFO */
> +	pgtable = page_huge_pte(page);
> +	page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
> +						       struct page, lru);
> +	if (page_huge_pte(page))
> +		list_del(&pgtable->lru);
> +	return pgtable;
> +}
> +
> +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> +{
> +	int i;
> +	pgtable_t pgtable;
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	if (!nr)
> +		return 0;
> +
> +	vmemmap_pgtable_init(page);
> +
> +	for (i = 0; i < nr; i++) {
> +		pte_t *pte_p;
> +
> +		pte_p = pte_alloc_one_kernel(&init_mm);
> +		if (!pte_p)
> +			goto out;
> +		vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> +	}
> +
> +	return 0;
> +out:
> +	while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> +		pte_free_kernel(&init_mm, page_to_virt(pgtable));
> +	return -ENOMEM;
> +}
> +
> +static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
> +{
> +	pgtable_t pgtable;
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	if (!nr)
> +		return;
> +
> +	pgtable = page_huge_pte(page);
> +	if (!pgtable)
> +		return;
> +
> +	while (nr-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> +		pte_free_kernel(&init_mm, page_to_virt(pgtable));
> +}

I may be confused.

In patch 9 of this series, the first call to vmemmap_pgtable_free() is made:

> @@ -1645,6 +1799,10 @@ void free_huge_page(struct page *page)
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
> +	free_huge_page_vmemmap(h, page);
> +	/* Must be called before the initialization of @page->lru */
> +	vmemmap_pgtable_free(h, page);
> +
>  	INIT_LIST_HEAD(&page->lru);
>  	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
>  	set_hugetlb_cgroup(page, NULL);

When I saw that comment in previous patch series, I assumed page->lru was
being used to store preallocated pages and pages to free.  However, unless
I am reading the code incorrectly it does not appear page->lru (of the huge
page) is being used for this purpose.  Is that correct?

If it is correct, would using page->lru of the huge page make this code
simpler?  I am just missing the reason why you are using
page_huge_pte(page)->lru

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 00/21] Free some vmemmap pages of hugetlb page
  2020-11-10 19:23 ` [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Mike Kravetz
@ 2020-11-11  3:21   ` Muchun Song
  0 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-11  3:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Oscar Salvador, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Wed, Nov 11, 2020 at 3:23 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
>
> Thanks for continuing to work this Muchun!
>
> On 11/8/20 6:10 AM, Muchun Song wrote:
> ...
> > For tail pages, the value of compound_head is the same. So we can reuse
> > first page of tail page structs. We map the virtual addresses of the
> > remaining 6 pages of tail page structs to the first tail page struct,
> > and then free these 6 pages. Therefore, we need to reserve at least 2
> > pages as vmemmap areas.
> >
> > When a hugetlbpage is freed to the buddy system, we should allocate six
> > pages for vmemmap pages and restore the previous mapping relationship.
> >
> > If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very
> > substantial gain.
>
> Is that 4095 number accurate?  Are we not using two pages of struct pages
> as in the 2MB case?

Oh, yeah, here should be 4094 and subtract page tables. For a 1GB
HugeTLB page, it should be 4086 pages. Thanks for pointing out
this problem.

>
> Also, because we are splitting the huge page mappings in the vmemmap
> additional PTE pages will need to be allocated.  Therefore, some additional
> page table pages may need to be allocated so that we can free the pages
> of struct pages.  The net savings may be less than what is stated above.
>
> Perhaps this should mention that allocation of additional page table pages
> may be required?

Yeah, you are right. In the later patch, I will rework the analysis
here. Make it
more clear and accurate.

>
> ...
> > Because there are vmemmap page tables reconstruction on the freeing/allocating
> > path, it increases some overhead. Here are some overhead analysis.
> >
> > 1) Allocating 10240 2MB hugetlb pages.
> >
> >    a) With this patch series applied:
> >    # time echo 10240 > /proc/sys/vm/nr_hugepages
> >
> >    real     0m0.166s
> >    user     0m0.000s
> >    sys      0m0.166s
> >
> >    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> >    Attaching 2 probes...
> >
> >    @latency:
> >    [8K, 16K)           8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >    [16K, 32K)          1868 |@@@@@@@@@@@                                         |
> >    [32K, 64K)            10 |                                                    |
> >    [64K, 128K)            2 |                                                    |
> >
> >    b) Without this patch series:
> >    # time echo 10240 > /proc/sys/vm/nr_hugepages
> >
> >    real     0m0.066s
> >    user     0m0.000s
> >    sys      0m0.066s
> >
> >    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> >    Attaching 2 probes...
> >
> >    @latency:
> >    [4K, 8K)           10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >    [8K, 16K)             62 |                                                    |
> >    [16K, 32K)             2 |                                                    |
> >
> >    Summarize: this feature is about ~2x slower than before.
> >
> > 2) Freeing 10240 @MB hugetlb pages.
> >
> >    a) With this patch series applied:
> >    # time echo 0 > /proc/sys/vm/nr_hugepages
> >
> >    real     0m0.004s
> >    user     0m0.000s
> >    sys      0m0.002s
> >
> >    # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> >    Attaching 2 probes...
> >
> >    @latency:
> >    [16K, 32K)         10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> >    b) Without this patch series:
> >    # time echo 0 > /proc/sys/vm/nr_hugepages
> >
> >    real     0m0.077s
> >    user     0m0.001s
> >    sys      0m0.075s
> >
> >    # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> >    Attaching 2 probes...
> >
> >    @latency:
> >    [4K, 8K)            9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >    [8K, 16K)            287 |@                                                   |
> >    [16K, 32K)             3 |                                                    |
> >
> >    Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
> >               But according to the allocation test above, I think that here is
> >             also ~2x slower than before.
> >
> >               But why the 'real' time of patched is smaller than before? Because
> >             In this patch series, the freeing hugetlb is asynchronous(through
> >             kwoker).
> >
> > Although the overhead has increased. But the overhead is not on the
> > allocating/freeing of each hugetlb page, it is only once when we reserve
> > some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation
> > is successful, the subsequent allocating, freeing and using are the same
> > as before (not patched). So I think that the overhead is acceptable.
>
> Thank you for benchmarking.  There are still some instances where huge pages
> are allocated 'on the fly' instead of being pulled from the pool.  Michal
> pointed out the case of page migration.  It is also possible for someone to
> use hugetlbfs without pre-allocating huge pages to the pool.  I remember the
> use case pointed out in commit 099730d67417.  It says, "I have a hugetlbfs
> user which is never explicitly allocating huge pages with 'nr_hugepages'.
> They only set 'nr_overcommit_hugepages' and then let the pages be allocated
> from the buddy allocator at fault time."  In this case, I suspect they were
> using 'page fault' allocation for initialization much like someone using
> /proc/sys/vm/nr_hugepages.  So, the overhead may not be as noticeable.

Thanks for pointing out this using case.

>
> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-10 19:38       ` Mike Kravetz
@ 2020-11-11  3:22         ` Muchun Song
  0 siblings, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-11  3:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, Jonathan Corbet, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Wed, Nov 11, 2020 at 3:40 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/9/20 6:42 PM, Muchun Song wrote:
> > On Tue, Nov 10, 2020 at 12:48 AM Oscar Salvador <osalvador@suse.de> wrote:
> >>
> >> On Sun, Nov 08, 2020 at 10:10:56PM +0800, Muchun Song wrote:
> >>
> >> Unrelated to this patch but related in general, I am not sure about Mike but
> >> would it be cleaner to move all the vmemmap functions to hugetlb_vmemmap.c?
> >> hugetlb code is quite tricky, so I am not sure about stuffing more code
> >> in there.
> >>
> >
> > I also think that you are right, moving all the vmemmap functions to
> > hugetlb_vmemmap.c may make the code cleaner.
> >
> > Hi Mike, what's your opinion?
>
> I would be happy to see this in a separate file.  As Oscar mentions, the
> hugetlb.c file/code is already somethat difficult to read and understand.

Got it. I will do this. Thanks.

> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-10 19:31     ` Mike Kravetz
  2020-11-10 19:50       ` Matthew Wilcox
@ 2020-11-11  3:28       ` Muchun Song
  1 sibling, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-11  3:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, Jonathan Corbet, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Wed, Nov 11, 2020 at 3:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/9/20 5:52 AM, Oscar Salvador wrote:
> > On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> >> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> >> whether to enable the feature of freeing unused vmemmap associated
> >> with HugeTLB pages. Now only support x86.
> >>
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> ---
> >>  arch/x86/mm/init_64.c |  2 +-
> >>  fs/Kconfig            | 16 ++++++++++++++++
> >>  mm/bootmem_info.c     |  3 +--
> >>  3 files changed, 18 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 0a45f062826e..0435bee2e172 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >>
> >>  static void __init register_page_bootmem_info(void)
> >>  {
> >> -#ifdef CONFIG_NUMA
> >> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> >>      int i;
> >>
> >>      for_each_online_node(i)
> >> diff --git a/fs/Kconfig b/fs/Kconfig
> >> index 976e8b9033c4..21b8d39a9715 100644
> >> --- a/fs/Kconfig
> >> +++ b/fs/Kconfig
> >> @@ -245,6 +245,22 @@ config HUGETLBFS
> >>  config HUGETLB_PAGE
> >>      def_bool HUGETLBFS
> >>
> >> +config HUGETLB_PAGE_FREE_VMEMMAP
> >> +    bool "Free unused vmemmap associated with HugeTLB pages"
> >> +    default y
> >> +    depends on X86
> >> +    depends on HUGETLB_PAGE
> >> +    depends on SPARSEMEM_VMEMMAP
> >> +    depends on HAVE_BOOTMEM_INFO_NODE
> >> +    help
> >> +      There are many struct page structures associated with each HugeTLB
> >> +      page. But we only use a few struct page structures. In this case,
> >> +      it wastes some memory. It is better to free the unused struct page
> >> +      structures to buddy system which can save some memory. For
> >> +      architectures that support it, say Y here.
> >> +
> >> +      If unsure, say N.
> >
> > I am not sure the above is useful for someone who needs to decide
> > whether he needs/wants to enable this or not.
> > I think the above fits better in a Documentation part.
> >
> > I suck at this, but what about the following, or something along those
> > lines?
> >
> > "
> > When using SPARSEMEM_VMEMMAP, the system can save up some memory
> > from pre-allocated HugeTLB pages when they are not used.
> > 6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
> > When the pages are going to be used or freed up, the vmemmap
> > array representing that range needs to be remapped again and
> > the pages we discarded earlier need to be rellocated again.
> > Therefore, this is a trade-off between saving memory and
> > increasing time in allocation/free path.
> > "
> >
> > It would be also great to point out that this might be a
> > trade-off between saving up memory and increasing the cost
> > of certain operations on allocation/free path.
> > That is why I mentioned it there.
>
> Yes, this is somewhat a trade-off.
>
> As a config option, this is something that would likely be decided by
> distros.  I almost hate to suggest this, but is it something that an
> end user would want to decide?  Is this something that perhaps should
> be a boot/kernel command line option?

Yeah, it already has a boot/kernel command line option named
"hugetlb_free_vmemmap". We can refer to

  [PATCH v3 18/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap

Thanks :)


>
> --
> Mike Kravetz



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-11  0:47   ` Mike Kravetz
@ 2020-11-11  3:41     ` Muchun Song
  2020-11-13  0:35       ` Mike Kravetz
  0 siblings, 1 reply; 54+ messages in thread
From: Muchun Song @ 2020-11-11  3:41 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Oscar Salvador, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Wed, Nov 11, 2020 at 8:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/8/20 6:10 AM, Muchun Song wrote:
> > On x86_64, vmemmap is always PMD mapped if the machine has hugepages
> > support and if we have 2MB contiguos pages and PMD aligned. If we want
> > to free the unused vmemmap pages, we have to split the huge pmd firstly.
> > So we should pre-allocate pgtable to split PMD to PTE.
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a0007902fafb..5c7be2ee7e15 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1303,6 +1303,108 @@ static inline void destroy_compound_gigantic_page(struct page *page,
> ...
> > +static inline void vmemmap_pgtable_init(struct page *page)
> > +{
> > +     page_huge_pte(page) = NULL;
> > +}
> > +
> > +static void vmemmap_pgtable_deposit(struct page *page, pgtable_t pgtable)
> > +{
> > +     /* FIFO */
> > +     if (!page_huge_pte(page))
> > +             INIT_LIST_HEAD(&pgtable->lru);
> > +     else
> > +             list_add(&pgtable->lru, &page_huge_pte(page)->lru);
> > +     page_huge_pte(page) = pgtable;
> > +}
> > +
> > +static pgtable_t vmemmap_pgtable_withdraw(struct page *page)
> > +{
> > +     pgtable_t pgtable;
> > +
> > +     /* FIFO */
> > +     pgtable = page_huge_pte(page);
> > +     page_huge_pte(page) = list_first_entry_or_null(&pgtable->lru,
> > +                                                    struct page, lru);
> > +     if (page_huge_pte(page))
> > +             list_del(&pgtable->lru);
> > +     return pgtable;
> > +}
> > +
> > +static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > +{
> > +     int i;
> > +     pgtable_t pgtable;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     if (!nr)
> > +             return 0;
> > +
> > +     vmemmap_pgtable_init(page);
> > +
> > +     for (i = 0; i < nr; i++) {
> > +             pte_t *pte_p;
> > +
> > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > +             if (!pte_p)
> > +                     goto out;
> > +             vmemmap_pgtable_deposit(page, virt_to_page(pte_p));
> > +     }
> > +
> > +     return 0;
> > +out:
> > +     while (i-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
> > +     return -ENOMEM;
> > +}
> > +
> > +static void vmemmap_pgtable_free(struct hstate *h, struct page *page)
> > +{
> > +     pgtable_t pgtable;
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     if (!nr)
> > +             return;
> > +
> > +     pgtable = page_huge_pte(page);
> > +     if (!pgtable)
> > +             return;
> > +
> > +     while (nr-- && (pgtable = vmemmap_pgtable_withdraw(page)))
> > +             pte_free_kernel(&init_mm, page_to_virt(pgtable));
> > +}
>
> I may be confused.
>
> In patch 9 of this series, the first call to vmemmap_pgtable_free() is made:
>
> > @@ -1645,6 +1799,10 @@ void free_huge_page(struct page *page)
> >
> >  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> >  {
> > +     free_huge_page_vmemmap(h, page);
> > +     /* Must be called before the initialization of @page->lru */
> > +     vmemmap_pgtable_free(h, page);
> > +
> >       INIT_LIST_HEAD(&page->lru);
> >       set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> >       set_hugetlb_cgroup(page, NULL);
>
> When I saw that comment in previous patch series, I assumed page->lru was
> being used to store preallocated pages and pages to free.  However, unless

Yeah, you are right.

> I am reading the code incorrectly it does not appear page->lru (of the huge
> page) is being used for this purpose.  Is that correct?
>
> If it is correct, would using page->lru of the huge page make this code
> simpler?  I am just missing the reason why you are using
> page_huge_pte(page)->lru

For 1GB HugeTLB pages, we should pre-allocate more than one page
table. So I use a linked list. The page_huge_pte(page) is the list head.
Because the page->lru shares storage with page->pmd_huge_pte.

+     /* Must be called before the initialization of @page->lru */
+     vmemmap_pgtable_free(h, page);
+
       INIT_LIST_HEAD(&page->lru);

Here we initialize the lru. So the vmemmap_pgtable_free should
be called before this. It seems like that I should point out this "share"
in the comment.

Thanks.

>
> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-11  3:41     ` [External] " Muchun Song
@ 2020-11-13  0:35       ` Mike Kravetz
  2020-11-13  1:02         ` Mike Kravetz
  2020-11-13  4:18         ` Muchun Song
  0 siblings, 2 replies; 54+ messages in thread
From: Mike Kravetz @ 2020-11-13  0:35 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Oscar Salvador, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On 11/10/20 7:41 PM, Muchun Song wrote:
> On Wed, Nov 11, 2020 at 8:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 11/8/20 6:10 AM, Muchun Song wrote:
>> I am reading the code incorrectly it does not appear page->lru (of the huge
>> page) is being used for this purpose.  Is that correct?
>>
>> If it is correct, would using page->lru of the huge page make this code
>> simpler?  I am just missing the reason why you are using
>> page_huge_pte(page)->lru
> 
> For 1GB HugeTLB pages, we should pre-allocate more than one page
> table. So I use a linked list. The page_huge_pte(page) is the list head.
> Because the page->lru shares storage with page->pmd_huge_pte.

Sorry, but I do not understand the statement page->lru shares storage with
page->pmd_huge_pte.  Are you saying they are both in head struct page of
the huge page?

Here is what I was suggesting.  If we just use page->lru for the list
then vmemmap_pgtable_prealloc() could be coded like the following:

static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
{
	struct page *pte_page, *t_page;
	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);

	if (!nr)
		return 0;

	/* Store preallocated pages on huge page lru list */
	INIT_LIST_HEAD(&page->lru);

	while (nr--) {
		pte_t *pte_p;

		pte_p = pte_alloc_one_kernel(&init_mm);
		if (!pte_p)
			goto out;
		list_add(&virt_to_page(pte_p)->lru, &page->lru);
	}

	return 0;
out:
	list_for_each_entry_safe(pte_page, t_page, &page->lru, lru)
		pte_free_kernel(&init_mm, page_to_virt(pte_page));
	return -ENOMEM;
}

By doing this we could eliminate the routines,
vmemmap_pgtable_init()
vmemmap_pgtable_deposit()
vmemmap_pgtable_withdraw()
and simply use the list manipulation routines.

To me, that looks simpler than the proposed code in this patch.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-13  0:35       ` Mike Kravetz
@ 2020-11-13  1:02         ` Mike Kravetz
  2020-11-13  4:18         ` Muchun Song
  1 sibling, 0 replies; 54+ messages in thread
From: Mike Kravetz @ 2020-11-13  1:02 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Oscar Salvador, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On 11/12/20 4:35 PM, Mike Kravetz wrote:
> On 11/10/20 7:41 PM, Muchun Song wrote:
>> On Wed, Nov 11, 2020 at 8:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 11/8/20 6:10 AM, Muchun Song wrote:
>>> I am reading the code incorrectly it does not appear page->lru (of the huge
>>> page) is being used for this purpose.  Is that correct?
>>>
>>> If it is correct, would using page->lru of the huge page make this code
>>> simpler?  I am just missing the reason why you are using
>>> page_huge_pte(page)->lru
>>
>> For 1GB HugeTLB pages, we should pre-allocate more than one page
>> table. So I use a linked list. The page_huge_pte(page) is the list head.
>> Because the page->lru shares storage with page->pmd_huge_pte.
> 
> Sorry, but I do not understand the statement page->lru shares storage with
> page->pmd_huge_pte.  Are you saying they are both in head struct page of
> the huge page?
> 
> Here is what I was suggesting.  If we just use page->lru for the list
> then vmemmap_pgtable_prealloc() could be coded like the following:
> 
> static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> {
> 	struct page *pte_page, *t_page;
> 	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> 
> 	if (!nr)
> 		return 0;
> 
> 	/* Store preallocated pages on huge page lru list */
> 	INIT_LIST_HEAD(&page->lru);
> 
> 	while (nr--) {
> 		pte_t *pte_p;
> 
> 		pte_p = pte_alloc_one_kernel(&init_mm);
> 		if (!pte_p)
> 			goto out;
> 		list_add(&virt_to_page(pte_p)->lru, &page->lru);
> 	}
> 
> 	return 0;
> out:
> 	list_for_each_entry_safe(pte_page, t_page, &page->lru, lru)

Forgot the list_del(&pte_page->lru)
Perhaps it is not simpler after all. :)
-- 
Mike Kravetz

> 		pte_free_kernel(&init_mm, page_to_virt(pte_page));
> 	return -ENOMEM;
> }
> 
> By doing this we could eliminate the routines,
> vmemmap_pgtable_init()
> vmemmap_pgtable_deposit()
> vmemmap_pgtable_withdraw()
> and simply use the list manipulation routines.
> 
> To me, that looks simpler than the proposed code in this patch.
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-13  0:35       ` Mike Kravetz
  2020-11-13  1:02         ` Mike Kravetz
@ 2020-11-13  4:18         ` Muchun Song
  1 sibling, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-13  4:18 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Jonathan Corbet, Thomas Gleixner, mingo, bp, x86, hpa,
	dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton, paulmck,
	mchehab+huawei, pawan.kumar.gupta, Randy Dunlap, oneukum,
	anshuman.khandual, jroedel, Mina Almasry, David Rientjes,
	Matthew Wilcox, Oscar Salvador, Michal Hocko, Xiongchun duan,
	linux-doc, LKML, Linux Memory Management List, linux-fsdevel

On Fri, Nov 13, 2020 at 8:38 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/10/20 7:41 PM, Muchun Song wrote:
> > On Wed, Nov 11, 2020 at 8:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 11/8/20 6:10 AM, Muchun Song wrote:
> >> I am reading the code incorrectly it does not appear page->lru (of the huge
> >> page) is being used for this purpose.  Is that correct?
> >>
> >> If it is correct, would using page->lru of the huge page make this code
> >> simpler?  I am just missing the reason why you are using
> >> page_huge_pte(page)->lru
> >
> > For 1GB HugeTLB pages, we should pre-allocate more than one page
> > table. So I use a linked list. The page_huge_pte(page) is the list head.
> > Because the page->lru shares storage with page->pmd_huge_pte.
>
> Sorry, but I do not understand the statement page->lru shares storage with
> page->pmd_huge_pte.  Are you saying they are both in head struct page of
> the huge page?
>
> Here is what I was suggesting.  If we just use page->lru for the list
> then vmemmap_pgtable_prealloc() could be coded like the following:
>
> static int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> {
>         struct page *pte_page, *t_page;
>         unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
>
>         if (!nr)
>                 return 0;
>
>         /* Store preallocated pages on huge page lru list */
>         INIT_LIST_HEAD(&page->lru);
>
>         while (nr--) {
>                 pte_t *pte_p;
>
>                 pte_p = pte_alloc_one_kernel(&init_mm);
>                 if (!pte_p)
>                         goto out;
>                 list_add(&virt_to_page(pte_p)->lru, &page->lru);
>         }
>
>         return 0;
> out:
>         list_for_each_entry_safe(pte_page, t_page, &page->lru, lru)
>                 pte_free_kernel(&init_mm, page_to_virt(pte_page));
>         return -ENOMEM;
> }
>
> By doing this we could eliminate the routines,
> vmemmap_pgtable_init()
> vmemmap_pgtable_deposit()
> vmemmap_pgtable_withdraw()
> and simply use the list manipulation routines.

Now I know what you mean. Yeah, just use page->lru can make code
simply. Thanks for your suggestions.

>
> To me, that looks simpler than the proposed code in this patch.
> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [External] Re: [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-10 19:50       ` Matthew Wilcox
  2020-11-10 20:30         ` Mike Kravetz
@ 2020-11-17 15:35         ` Muchun Song
  1 sibling, 0 replies; 54+ messages in thread
From: Muchun Song @ 2020-11-17 15:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, Oscar Salvador, Jonathan Corbet, Thomas Gleixner,
	mingo, bp, x86, hpa, dave.hansen, luto, Peter Zijlstra, viro,
	Andrew Morton, paulmck, mchehab+huawei, pawan.kumar.gupta,
	Randy Dunlap, oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Michal Hocko, Xiongchun duan, linux-doc, LKML,
	Linux Memory Management List, linux-fsdevel

On Wed, Nov 11, 2020 at 3:50 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Nov 10, 2020 at 11:31:31AM -0800, Mike Kravetz wrote:
> > On 11/9/20 5:52 AM, Oscar Salvador wrote:
> > > On Sun, Nov 08, 2020 at 10:10:55PM +0800, Muchun Song wrote:
> > >> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> > >> whether to enable the feature of freeing unused vmemmap associated
> > >> with HugeTLB pages. Now only support x86.
> > >>
> > >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > >> ---
> > >>  arch/x86/mm/init_64.c |  2 +-
> > >>  fs/Kconfig            | 16 ++++++++++++++++
> > >>  mm/bootmem_info.c     |  3 +--
> > >>  3 files changed, 18 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > >> index 0a45f062826e..0435bee2e172 100644
> > >> --- a/arch/x86/mm/init_64.c
> > >> +++ b/arch/x86/mm/init_64.c
> > >> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> > >>
> > >>  static void __init register_page_bootmem_info(void)
> > >>  {
> > >> -#ifdef CONFIG_NUMA
> > >> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> > >>    int i;
> > >>
> > >>    for_each_online_node(i)
> > >> diff --git a/fs/Kconfig b/fs/Kconfig
> > >> index 976e8b9033c4..21b8d39a9715 100644
> > >> --- a/fs/Kconfig
> > >> +++ b/fs/Kconfig
> > >> @@ -245,6 +245,22 @@ config HUGETLBFS
> > >>  config HUGETLB_PAGE
> > >>    def_bool HUGETLBFS
> > >>
> > >> +config HUGETLB_PAGE_FREE_VMEMMAP
> > >> +  bool "Free unused vmemmap associated with HugeTLB pages"
> > >> +  default y
> > >> +  depends on X86
> > >> +  depends on HUGETLB_PAGE
> > >> +  depends on SPARSEMEM_VMEMMAP
> > >> +  depends on HAVE_BOOTMEM_INFO_NODE
> > >> +  help
> > >> +    There are many struct page structures associated with each HugeTLB
> > >> +    page. But we only use a few struct page structures. In this case,
> > >> +    it wastes some memory. It is better to free the unused struct page
> > >> +    structures to buddy system which can save some memory. For
> > >> +    architectures that support it, say Y here.
> > >> +
> > >> +    If unsure, say N.
> > >
> > > I am not sure the above is useful for someone who needs to decide
> > > whether he needs/wants to enable this or not.
> > > I think the above fits better in a Documentation part.
> > >
> > > I suck at this, but what about the following, or something along those
> > > lines?
> > >
> > > "
> > > When using SPARSEMEM_VMEMMAP, the system can save up some memory
> > > from pre-allocated HugeTLB pages when they are not used.
> > > 6 pages per 2MB HugeTLB page and 4095 per 1GB HugeTLB page.
> > > When the pages are going to be used or freed up, the vmemmap
> > > array representing that range needs to be remapped again and
> > > the pages we discarded earlier need to be rellocated again.
> > > Therefore, this is a trade-off between saving memory and
> > > increasing time in allocation/free path.
> > > "
> > >
> > > It would be also great to point out that this might be a
> > > trade-off between saving up memory and increasing the cost
> > > of certain operations on allocation/free path.
> > > That is why I mentioned it there.
> >
> > Yes, this is somewhat a trade-off.
> >
> > As a config option, this is something that would likely be decided by
> > distros.  I almost hate to suggest this, but is it something that an
> > end user would want to decide?  Is this something that perhaps should
> > be a boot/kernel command line option?
>
> I don't like config options.  I like boot options even less.  I don't
> know how to describe to an end-user whether they should select this
> or not.  Is there a way to make this not a tradeoff?  Or make the
> tradeoff so minimal as to be not worth describing?  (do we have numbers
> for the worst possible situation when enabling this option?)
>
> I haven't read through these patches in detail, so maybe we do this
> already, but when we free the pages to the buddy allocator, do we retain
> the third page to use for the PTEs (and free pages 3-7), or do we allocate
> a separate page for the PTES and free pages 2-7?

Sorry for missing this reply. It is a good idea. I will start an investigation
and implement this. Thanks Matthew.



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2020-11-17 15:36 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-08 14:10 [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Muchun Song
2020-11-08 14:10 ` [PATCH v3 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
2020-11-08 14:10 ` [PATCH v3 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
2020-11-08 14:10 ` [PATCH v3 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
2020-11-09 13:52   ` Oscar Salvador
2020-11-09 14:20     ` [External] " Muchun Song
2020-11-10 19:31     ` Mike Kravetz
2020-11-10 19:50       ` Matthew Wilcox
2020-11-10 20:30         ` Mike Kravetz
2020-11-17 15:35         ` [External] " Muchun Song
2020-11-11  3:28       ` Muchun Song
2020-11-08 14:10 ` [PATCH v3 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
2020-11-09 16:48   ` Oscar Salvador
2020-11-10  2:42     ` [External] " Muchun Song
2020-11-10 19:38       ` Mike Kravetz
2020-11-11  3:22         ` Muchun Song
2020-11-08 14:10 ` [PATCH v3 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
2020-11-09 17:21   ` Oscar Salvador
2020-11-10  3:49     ` [External] " Muchun Song
2020-11-10  5:42       ` Oscar Salvador
2020-11-10  6:08         ` Muchun Song
2020-11-10  6:33           ` Oscar Salvador
2020-11-10  7:10             ` Muchun Song
2020-11-11  0:47   ` Mike Kravetz
2020-11-11  3:41     ` [External] " Muchun Song
2020-11-13  0:35       ` Mike Kravetz
2020-11-13  1:02         ` Mike Kravetz
2020-11-13  4:18         ` Muchun Song
2020-11-08 14:10 ` [PATCH v3 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
2020-11-08 14:10 ` [PATCH v3 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
2020-11-08 14:11 ` [PATCH v3 08/21] mm/vmemmap: Initialize page table lock for vmemmap Muchun Song
2020-11-09 18:11   ` Oscar Salvador
2020-11-10  5:17     ` [External] " Muchun Song
2020-11-08 14:11 ` [PATCH v3 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
2020-11-09 18:51   ` Oscar Salvador
2020-11-10  6:40     ` [External] " Muchun Song
2020-11-10  9:48       ` Oscar Salvador
2020-11-10 10:47         ` Muchun Song
2020-11-10 13:52           ` Oscar Salvador
2020-11-10 14:00             ` Muchun Song
2020-11-08 14:11 ` [PATCH v3 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
2020-11-08 14:11 ` [PATCH v3 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
2020-11-08 14:11 ` [PATCH v3 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
2020-11-08 14:11 ` [PATCH v3 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
2020-11-08 14:11 ` [PATCH v3 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
2020-11-08 14:11 ` [PATCH v3 15/21] mm/hugetlb: Add a BUILD_BUG_ON to check if struct page size is a power of two Muchun Song
2020-11-08 14:11 ` [PATCH v3 16/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
2020-11-08 14:11 ` [PATCH v3 17/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
2020-11-08 14:11 ` [PATCH v3 18/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
2020-11-08 14:11 ` [PATCH v3 19/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
2020-11-08 14:11 ` [PATCH v3 20/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
2020-11-08 14:11 ` [PATCH v3 21/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
2020-11-10 19:23 ` [PATCH v3 00/21] Free some vmemmap pages of hugetlb page Mike Kravetz
2020-11-11  3:21   ` [External] " Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).