linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
@ 2020-11-20  6:43 Muchun Song
  2020-11-20  6:43 ` [PATCH v5 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
                   ` (21 more replies)
  0 siblings, 22 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Hi all,

This patch series will free some vmemmap pages(struct page structures)
associated with each hugetlbpage when preallocated to save memory.

The struct page structures (page structs) are used to describe a physical
page frame. By default, there is a one-to-one mapping from a page frame to
it's corresponding page struct.

The HugeTLB pages consist of multiple base page size pages and is supported
by many architectures. See hugetlbpage.rst in the Documentation directory
for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB
are currently supported. Since the base page size on x86 is 4KB, a 2MB
HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
4096 base pages. For each base page, there is a corresponding page struct.

Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
provides this upper limit. The only 'useful' information in the remaining
page structs is the compound_head field, and this field is the same for all
tail pages.

By removing redundant page structs for HugeTLB pages, memory can returned to
the buddy allocator for other uses.

When the system boot up, every 2M HugeTLB has 512 struct page structs which
size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | -------------> |     2     |
 |           |                     +-----------+                +-----------+
 |           |                     |     3     | -------------> |     3     |
 |           |                     +-----------+                +-----------+
 |           |                     |     4     | -------------> |     4     |
 |    2MB    |                     +-----------+                +-----------+
 |           |                     |     5     | -------------> |     5     |
 |           |                     +-----------+                +-----------+
 |           |                     |     6     | -------------> |     6     |
 |           |                     +-----------+                +-----------+
 |           |                     |     7     | -------------> |     7     |
 |           |                     +-----------+                +-----------+
 |           |
 |           |
 |           |
 +-----------+

The value of page->compound_head is the same for all tail pages. The first
page of page structs (page 0) associated with the HugeTLB page contains the 4
page structs necessary to describe the HugeTLB. The only use of the remaining
pages of page structs (page 1 to page 7) is to point to page->compound_head.
Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
will be used for each HugeTLB page. This will allow us to free the remaining
6 pages to the buddy allocator.

Here is how things look after remapping.

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
 |           |                     +-----------+                   | | | | |
 |           |                     |     3     | ------------------+ | | | |
 |           |                     +-----------+                     | | | |
 |           |                     |     4     | --------------------+ | | |
 |    2MB    |                     +-----------+                       | | |
 |           |                     |     5     | ----------------------+ | |
 |           |                     +-----------+                         | |
 |           |                     |     6     | ------------------------+ |
 |           |                     +-----------+                           |
 |           |                     |     7     | --------------------------+
 |           |                     +-----------+
 |           |
 |           |
 |           |
 +-----------+

When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
vmemmap pages and restore the previous mapping relationship.

Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
pages.

In this case, for the 1GB HugeTLB page, we can save 4088 pages(There are
4096 pages for struct page structs, we reserve 2 pages for vmemmap and 8
pages for page tables. So we can save 4088 pages). This is a very substantial
gain. On our server, run some SPDK/QEMU applications which will use 1024GB
hugetlbpage. With this feature enabled, we can save ~16GB(1G hugepage)/~11GB
(2MB hugepage, the worst case is 10GB while the best is 12GB) memory.

Because there are vmemmap page tables reconstruction on the freeing/allocating
path, it increases some overhead. Here are some overhead analysis.

1) Allocating 10240 2MB hugetlb pages.

   a) With this patch series applied:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.166s
   user     0m0.000s
   sys      0m0.166s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [8K, 16K)           8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [16K, 32K)          1868 |@@@@@@@@@@@                                         |
   [32K, 64K)            10 |                                                    |
   [64K, 128K)            2 |                                                    |

   b) Without this patch series:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.066s
   user     0m0.000s
   sys      0m0.066s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)           10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)             62 |                                                    |
   [16K, 32K)             2 |                                                    |

   Summarize: this feature is about ~2x slower than before.

2) Freeing 10240 2MB hugetlb pages.

   a) With this patch series applied:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.004s
   user     0m0.000s
   sys      0m0.002s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [16K, 32K)         10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

   b) Without this patch series:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.077s
   user     0m0.001s
   sys      0m0.075s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)            9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)            287 |@                                                   |
   [16K, 32K)             3 |                                                    |

   Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
              But according to the allocation test above, I think that here is
	      also ~2x slower than before.

              But why the 'real' time of patched is smaller than before? Because
	      In this patch series, the freeing hugetlb is asynchronous(through
	      kwoker).

Although the overhead has increased, the overhead is not significant. Like Mike
said, "However, remember that the majority of use cases create hugetlb pages at
or shortly after boot time and add them to the pool. So, additional overhead is
at pool creation time. There is no change to 'normal run time' operations of
getting a page from or returning a page to the pool (think page fault/unmap)".

Todo:
  1. Free all of the tail vmemmap pages
     Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. we really can
     free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page
     structures has beed set PG_head flag. If we can adjust compound_head()
     slightly and make compound_head() return the real head struct page when
     the parameter is the tail struct page but with PG_head flag set.

     In order to make the code evolution route clearer. This feature can can be
     a separate patch after this patchset is solid.

  Changelog in v5:
  1. Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
     Thanks to Mike and Oscar's suggestions.

  Changelog in v4:
  1. Move all the vmemmap functions to hugetlb_vmemmap.c.
  2. Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
     disable this feature, we should disable it by a boot/kernel command line.
  3. Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
  4. Initialize page table lock for vmemmap through core_initcall mechanism.

  Thanks for Mike and Oscar's suggestions.

  Changelog in v3:
  1. Rename some helps function name. Thanks Mike.
  2. Rework some code. Thanks Mike and Oscar.
  3. Remap the tail vmemmap page with PAGE_KERNEL_RO instead of
     PAGE_KERNEL. Thanks Matthew.
  4. Add some overhead analysis in the cover letter.
  5. Use vmemap pmd table lock instead of a hugetlb specific global lock.

  Changelog in v2:
  1. Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
  2. Fix some typo and code style problems.
  3. Remove unused handle_vmemmap_fault().
  4. Merge some commits to one commit suggested by Mike.

Muchun Song (21):
  mm/memory_hotplug: Move bootmem info registration API to
    bootmem_info.c
  mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c
  mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  mm/hugetlb: Introduce pgtable allocation/freeing helpers
  mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()
  mm/bootmem_info: Combine bootmem info and type into page->freelist
  mm/hugetlb: Initialize page table lock for vmemmap
  mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  mm/hugetlb: Defer freeing of hugetlb pages
  mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb
    page
  mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper
  mm/hugetlb: Use PG_slab to indicate split pmd
  mm/hugetlb: Support freeing vmemmap pages of gigantic page
  mm/hugetlb: Set the PageHWPoison to the raw error page
  mm/hugetlb: Flush work when dissolving hugetlb page
  mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  mm/hugetlb: Merge pte to huge pmd only for gigantic page
  mm/hugetlb: Gather discrete indexes of tail page
  mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct
    page
  mm/hugetlb: Disable freeing vmemmap if struct page size is not power
    of two

 Documentation/admin-guide/kernel-parameters.txt |   9 +
 Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
 arch/x86/include/asm/hugetlb.h                  |  17 +
 arch/x86/include/asm/pgtable_64_types.h         |   8 +
 arch/x86/mm/init_64.c                           |   7 +-
 fs/Kconfig                                      |  14 +
 include/linux/bootmem_info.h                    |  78 +++
 include/linux/hugetlb.h                         |  19 +
 include/linux/hugetlb_cgroup.h                  |  15 +-
 include/linux/memory_hotplug.h                  |  27 -
 mm/Makefile                                     |   2 +
 mm/bootmem_info.c                               | 124 ++++
 mm/hugetlb.c                                    | 163 ++++-
 mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h                            | 103 ++++
 mm/memory_hotplug.c                             | 116 ----
 mm/sparse.c                                     |   5 +-
 17 files changed, 1295 insertions(+), 180 deletions(-)
 create mode 100644 include/linux/bootmem_info.h
 create mode 100644 mm/bootmem_info.c
 create mode 100644 mm/hugetlb_vmemmap.c
 create mode 100644 mm/hugetlb_vmemmap.h

-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Move bootmem info registration common API to individual bootmem_info.c
for later patch use. This is just code movement without any functional
change.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/mm/init_64.c          |  1 +
 include/linux/bootmem_info.h   | 27 ++++++++++++
 include/linux/memory_hotplug.h | 23 ----------
 mm/Makefile                    |  1 +
 mm/bootmem_info.c              | 99 ++++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            | 91 +-------------------------------------
 6 files changed, 129 insertions(+), 113 deletions(-)
 create mode 100644 include/linux/bootmem_info.h
 create mode 100644 mm/bootmem_info.c

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..c7f7ad55b625 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/bootmem_info.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
new file mode 100644
index 000000000000..65bb9b23140f
--- /dev/null
+++ b/include/linux/bootmem_info.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_BOOTMEM_INFO_H
+#define __LINUX_BOOTMEM_INFO_H
+
+#include <linux/mmzone.h>
+
+/*
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
+ */
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
+
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
+#else
+static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+}
+#endif
+
+#endif /* __LINUX_BOOTMEM_INFO_H */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 51a877fec8da..19e5d067294c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,18 +33,6 @@ struct vmem_altmap;
 	___page;						   \
 })
 
-/*
- * Types for free bootmem stored in page->lru.next. These have to be in
- * some random range in unsigned long space for debugging purposes.
- */
-enum {
-	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
-	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
-	MIX_SECTION_INFO,
-	NODE_INFO,
-	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
-};
-
 /* Types for control the zone type of onlined and offlined memory */
 enum {
 	/* Offline the memory. */
@@ -209,13 +197,6 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
-#else
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-#endif
 extern void put_page_bootmem(struct page *page);
 extern void get_page_bootmem(unsigned long ingo, struct page *page,
 			     unsigned long type);
@@ -254,10 +235,6 @@ static inline int mhp_notimplemented(const char *func)
 	return -ENOSYS;
 }
 
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-
 static inline int try_online_node(int nid)
 {
 	return 0;
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..752111587c99 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)	+= kasan/
 obj-$(CONFIG_FAILSLAB) += failslab.o
+obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
new file mode 100644
index 000000000000..39fa8fc120bc
--- /dev/null
+++ b/mm/bootmem_info.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  linux/mm/bootmem_info.c
+ *
+ *  Copyright (C)
+ */
+#include <linux/mm.h>
+#include <linux/compiler.h>
+#include <linux/memblock.h>
+#include <linux/bootmem_info.h>
+#include <linux/memory_hotplug.h>
+
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	/* Get section's memmap address */
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	/*
+	 * Get page for the memmap's phys address
+	 * XXX: need more consideration for sparse_vmemmap...
+	 */
+	page = virt_to_page(memmap);
+	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
+
+	/* remember memmap's page */
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, SECTION_INFO);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+
+}
+#else /* CONFIG_SPARSEMEM_VMEMMAP */
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
+
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+	unsigned long i, pfn, end_pfn, nr_pages;
+	int node = pgdat->node_id;
+	struct page *page;
+
+	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	page = virt_to_page(pgdat);
+
+	for (i = 0; i < nr_pages; i++, page++)
+		get_page_bootmem(node, page, NODE_INFO);
+
+	pfn = pgdat->node_start_pfn;
+	end_pfn = pgdat_end_pfn(pgdat);
+
+	/* register section info */
+	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		/*
+		 * Some platforms can assign the same pfn to multiple nodes - on
+		 * node0 as well as nodeN.  To avoid registering a pfn against
+		 * multiple nodes we check that this pfn does not already
+		 * reside in some other nodes.
+		 */
+		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
+			register_page_bootmem_info_section(pfn);
+	}
+}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index baded53b9ff9..2da4ad071456 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -21,6 +21,7 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
+#include <linux/bootmem_info.h>
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
@@ -167,96 +168,6 @@ void put_page_bootmem(struct page *page)
 	}
 }
 
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	/* Get section's memmap address */
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	/*
-	 * Get page for the memmap's phys address
-	 * XXX: need more consideration for sparse_vmemmap...
-	 */
-	page = virt_to_page(memmap);
-	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
-	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
-
-	/* remember memmap's page */
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, SECTION_INFO);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-
-}
-#else /* CONFIG_SPARSEMEM_VMEMMAP */
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-}
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
-
-void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-	unsigned long i, pfn, end_pfn, nr_pages;
-	int node = pgdat->node_id;
-	struct page *page;
-
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
-	page = virt_to_page(pgdat);
-
-	for (i = 0; i < nr_pages; i++, page++)
-		get_page_bootmem(node, page, NODE_INFO);
-
-	pfn = pgdat->node_start_pfn;
-	end_pfn = pgdat_end_pfn(pgdat);
-
-	/* register section info */
-	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		/*
-		 * Some platforms can assign the same pfn to multiple nodes - on
-		 * node0 as well as nodeN.  To avoid registering a pfn against
-		 * multiple nodes we check that this pfn does not already
-		 * reside in some other nodes.
-		 */
-		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
-			register_page_bootmem_info_section(pfn);
-	}
-}
-#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
  2020-11-20  6:43 ` [PATCH v5 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the later patch, we will use {get,put}_page_bootmem() to initialize
the page for vmemmap or free vmemmap page to buddy. So move them out of
CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any
functional change.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/mm/init_64.c          |  2 +-
 include/linux/bootmem_info.h   | 13 +++++++++++++
 include/linux/memory_hotplug.h |  4 ----
 mm/bootmem_info.c              | 25 +++++++++++++++++++++++++
 mm/memory_hotplug.c            | 27 ---------------------------
 mm/sparse.c                    |  1 +
 6 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index c7f7ad55b625..0a45f062826e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1572,7 +1572,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return err;
 }
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE)
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long nr_pages)
 {
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 65bb9b23140f..4ed6dee1adc9 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -18,10 +18,23 @@ enum {
 
 #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
+
+void get_page_bootmem(unsigned long info, struct page *page,
+		      unsigned long type);
+void put_page_bootmem(struct page *page);
 #else
 static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
 }
+
+static inline void put_page_bootmem(struct page *page)
+{
+}
+
+static inline void get_page_bootmem(unsigned long info, struct page *page,
+				    unsigned long type)
+{
+}
 #endif
 
 #endif /* __LINUX_BOOTMEM_INFO_H */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 19e5d067294c..c9f3361fe84b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -197,10 +197,6 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-extern void put_page_bootmem(struct page *page);
-extern void get_page_bootmem(unsigned long ingo, struct page *page,
-			     unsigned long type);
-
 void get_online_mems(void);
 void put_online_mems(void);
 
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 39fa8fc120bc..fcab5a3f8cc0 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -10,6 +10,31 @@
 #include <linux/bootmem_info.h>
 #include <linux/memory_hotplug.h>
 
+void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
+{
+	page->freelist = (void *)type;
+	SetPagePrivate(page);
+	set_page_private(page, info);
+	page_ref_inc(page);
+}
+
+void put_page_bootmem(struct page *page)
+{
+	unsigned long type;
+
+	type = (unsigned long) page->freelist;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
+
+	if (page_ref_dec_return(page) == 1) {
+		page->freelist = NULL;
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+		INIT_LIST_HEAD(&page->lru);
+		free_reserved_page(page);
+	}
+}
+
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2da4ad071456..ae57eedc341f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -21,7 +21,6 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
-#include <linux/bootmem_info.h>
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
@@ -142,32 +141,6 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-void get_page_bootmem(unsigned long info,  struct page *page,
-		      unsigned long type)
-{
-	page->freelist = (void *)type;
-	SetPagePrivate(page);
-	set_page_private(page, info);
-	page_ref_inc(page);
-}
-
-void put_page_bootmem(struct page *page)
-{
-	unsigned long type;
-
-	type = (unsigned long) page->freelist;
-	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
-	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
-
-	if (page_ref_dec_return(page) == 1) {
-		page->freelist = NULL;
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		INIT_LIST_HEAD(&page->lru);
-		free_reserved_page(page);
-	}
-}
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
diff --git a/mm/sparse.c b/mm/sparse.c
index b25ad8e64839..a4138410d890 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/bootmem_info.h>
 
 #include "internal.h"
 #include <asm/dma.h>
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
  2020-11-20  6:43 ` [PATCH v5 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
  2020-11-20  6:43 ` [PATCH v5 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  7:49   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
whether to enable the feature of freeing unused vmemmap associated
with HugeTLB pages. Now only support x86.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/mm/init_64.c |  2 +-
 fs/Kconfig            | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0a45f062826e..0435bee2e172 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
 
 static void __init register_page_bootmem_info(void)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
 	int i;
 
 	for_each_online_node(i)
diff --git a/fs/Kconfig b/fs/Kconfig
index 976e8b9033c4..4961dd488444 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -245,6 +245,20 @@ config HUGETLBFS
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
 
+config HUGETLB_PAGE_FREE_VMEMMAP
+	def_bool HUGETLB_PAGE
+	depends on X86
+	depends on SPARSEMEM_VMEMMAP
+	depends on HAVE_BOOTMEM_INFO_NODE
+	help
+	  When using HUGETLB_PAGE_FREE_VMEMMAP, the system can save up some
+	  memory from pre-allocated HugeTLB pages when they are not used.
+	  6 pages per 2MB HugeTLB page and 4094 per 1GB HugeTLB page.
+
+	  When the pages are going to be used or freed up, the vmemmap array
+	  representing that range needs to be remapped again and the pages
+	  we discarded earlier need to be rellocated again.
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (2 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Every HugeTLB has more than one struct page structure. The 2M HugeTLB
has 512 struct page structure and 1G HugeTLB has 4096 struct page
structures. We __know__ that we only use the first 4(HUGETLB_CGROUP_MIN_ORDER)
struct page structures to store metadata associated with each HugeTLB.

There are a lot of struct page structures(8 page frames for 2MB HugeTLB
page and 4096 page frames for 1GB HugeTLB page) associated with each
HugeTLB page. For tail pages, the value of compound_head is the same.
So we can reuse first page of tail page structures. We map the virtual
addresses of the remaining pages of tail page structures to the first
tail page struct, and then free these page frames. Therefore, we need
to reserve two pages as vmemmap areas.

So we introduce a new nr_free_vmemmap_pages field in the hstate to
indicate how many vmemmap pages associated with a HugeTLB page that we
can free to buddy system.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |   3 ++
 mm/Makefile             |   1 +
 mm/hugetlb.c            |   3 ++
 mm/hugetlb_vmemmap.c    | 134 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h    |  20 ++++++++
 5 files changed, 161 insertions(+)
 create mode 100644 mm/hugetlb_vmemmap.c
 create mode 100644 mm/hugetlb_vmemmap.h

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5cc5f802dd4..eed3dd3bd626 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -492,6 +492,9 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	unsigned int nr_free_vmemmap_pages;
+#endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
 	struct cftype cgroup_files_dfl[7];
diff --git a/mm/Makefile b/mm/Makefile
index 752111587c99..2a734576bbc0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
+obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)	+= hugetlb_vmemmap.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 81a41aa080a5..f88032c24667 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -42,6 +42,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_owner.h>
 #include "internal.h"
+#include "hugetlb_vmemmap.h"
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
@@ -3285,6 +3286,8 @@ void __init hugetlb_add_hstate(unsigned int order)
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
+	hugetlb_vmemmap_init(h);
+
 	parsed_hstate = h;
 }
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
new file mode 100644
index 000000000000..1afe245395e5
--- /dev/null
+++ b/mm/hugetlb_vmemmap.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ *     Author: Muchun Song <songmuchun@bytedance.com>
+ *
+ * The struct page structures (page structs) are used to describe a physical
+ * page frame. By default, there is a one-to-one mapping from a page frame to
+ * it's corresponding page struct.
+ *
+ * The HugeTLB pages consist of multiple base page size pages and is supported
+ * by many architectures. See hugetlbpage.rst in the Documentation directory
+ * for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB
+ * are currently supported. Since the base page size on x86 is 4KB, a 2MB
+ * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+ * 4096 base pages. For each base page, there is a corresponding page struct.
+ *
+ * Within the HugeTLB subsystem, only the first 4 page structs are used to
+ * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
+ * provides this upper limit. The only 'useful' information in the remaining
+ * page structs is the compound_head field, and this field is the same for all
+ * tail pages.
+ *
+ * By removing redundant page structs for HugeTLB pages, memory can returned to
+ * the buddy allocator for other uses.
+ *
+ * When the system boot up, every 2M HugeTLB has 512 struct page structs which
+ * size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
+ *
+ *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ * |           |                     |     0     | -------------> |     0     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     1     | -------------> |     1     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     2     | -------------> |     2     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     3     | -------------> |     3     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     4     | -------------> |     4     |
+ * |    2MB    |                     +-----------+                +-----------+
+ * |           |                     |     5     | -------------> |     5     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     6     | -------------> |     6     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     7     | -------------> |     7     |
+ * |           |                     +-----------+                +-----------+
+ * |           |
+ * |           |
+ * |           |
+ * +-----------+
+ *
+ * The value of page->compound_head is the same for all tail pages. The first
+ * page of page structs (page 0) associated with the HugeTLB page contains the 4
+ * page structs necessary to describe the HugeTLB. The only use of the remaining
+ * pages of page structs (page 1 to page 7) is to point to page->compound_head.
+ * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+ * will be used for each HugeTLB page. This will allow us to free the remaining
+ * 6 pages to the buddy allocator.
+ *
+ * Here is how things look after remapping.
+ *
+ *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ * |           |                     |     0     | -------------> |     0     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     1     | -------------> |     1     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ * |           |                     +-----------+                   | | | | |
+ * |           |                     |     3     | ------------------+ | | | |
+ * |           |                     +-----------+                     | | | |
+ * |           |                     |     4     | --------------------+ | | |
+ * |    2MB    |                     +-----------+                       | | |
+ * |           |                     |     5     | ----------------------+ | |
+ * |           |                     +-----------+                         | |
+ * |           |                     |     6     | ------------------------+ |
+ * |           |                     +-----------+                           |
+ * |           |                     |     7     | --------------------------+
+ * |           |                     +-----------+
+ * |           |
+ * |           |
+ * |           |
+ * +-----------+
+ *
+ * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+ * vmemmap pages and restore the previous mapping relationship.
+ *
+ * Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
+ * to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
+ * pages.
+ *
+ * In this case, for the 1GB HugeTLB page, we can save 4088 pages(There are
+ * 4096 pages for struct page structs, we reserve 2 pages for vmemmap and 8
+ * pages for page tables. So we can save 4088 pages). This is a very substantial
+ * gain.
+ */
+#define pr_fmt(fmt)	"HugeTLB Vmemmap: " fmt
+
+#include "hugetlb_vmemmap.h"
+
+/*
+ * There are a lot of struct page structures(8 page frames for 2MB HugeTLB page
+ * and 4096 page frames for 1GB HugeTLB page) associated with each HugeTLB page.
+ * For tail pages, the value of compound_head is the same. So we can reuse first
+ * page of tail page structures. We map the virtual addresses of the remaining
+ * pages of tail page structures to the first tail page struct, and then free
+ * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
+ */
+#define RESERVE_VMEMMAP_NR		2U
+
+void __init hugetlb_vmemmap_init(struct hstate *h)
+{
+	unsigned int order = huge_page_order(h);
+	unsigned int vmemmap_pages;
+
+	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
+	/*
+	 * The head page and the first tail page are not to be freed to buddy
+	 * system, the others page will map to the first tail page. So there
+	 * are the remaining pages that can be freed.
+	 *
+	 * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? This is
+	 * not expected to happen unless the system is corrupted. So on the
+	 * safe side, it is only a safety net.
+	 */
+	if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR))
+		h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR;
+
+	pr_debug("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages,
+		 h->name);
+}
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
new file mode 100644
index 000000000000..40c0c7dfb60d
--- /dev/null
+++ b/mm/hugetlb_vmemmap.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ *     Author: Muchun Song <songmuchun@bytedance.com>
+ */
+#ifndef _LINUX_HUGETLB_VMEMMAP_H
+#define _LINUX_HUGETLB_VMEMMAP_H
+#include <linux/hugetlb.h>
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+void __init hugetlb_vmemmap_init(struct hstate *h);
+#else
+static inline void hugetlb_vmemmap_init(struct hstate *h)
+{
+}
+#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
+#endif /* _LINUX_HUGETLB_VMEMMAP_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (3 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

On x86_64, vmemmap is always PMD mapped if the machine has hugepages
support and if we have 2MB contiguous pages and PMD alignment. If we
want to free the unused vmemmap pages, we have to split the huge PMD
firstly. So we should pre-allocate pgtable to split PMD to PTE.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h | 11 ++++++++
 2 files changed, 87 insertions(+)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 1afe245395e5..ec70980000d8 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -99,6 +99,8 @@
  */
 #define pr_fmt(fmt)	"HugeTLB Vmemmap: " fmt
 
+#include <linux/list.h>
+#include <asm/pgalloc.h>
 #include "hugetlb_vmemmap.h"
 
 /*
@@ -111,6 +113,80 @@
  */
 #define RESERVE_VMEMMAP_NR		2U
 
+#ifndef VMEMMAP_HPAGE_SHIFT
+#define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
+#endif
+#define VMEMMAP_HPAGE_ORDER		(VMEMMAP_HPAGE_SHIFT - PAGE_SHIFT)
+#define VMEMMAP_HPAGE_NR		(1 << VMEMMAP_HPAGE_ORDER)
+#define VMEMMAP_HPAGE_SIZE		((1UL) << VMEMMAP_HPAGE_SHIFT)
+#define VMEMMAP_HPAGE_MASK		(~(VMEMMAP_HPAGE_SIZE - 1))
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return h->nr_free_vmemmap_pages;
+}
+
+static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
+}
+
+static inline unsigned long vmemmap_pages_size_per_hpage(struct hstate *h)
+{
+	return (unsigned long)vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
+}
+
+static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
+{
+	unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
+
+	/*
+	 * No need pre-allocate page tables when there is no vmemmap pages
+	 * to free.
+	 */
+	if (!free_vmemmap_pages_per_hpage(h))
+		return 0;
+
+	return ALIGN(vmemmap_size, VMEMMAP_HPAGE_SIZE) >> VMEMMAP_HPAGE_SHIFT;
+}
+
+void vmemmap_pgtable_free(struct page *page)
+{
+	struct page *pte_page, *t_page;
+
+	list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
+		list_del(&pte_page->lru);
+		pte_free_kernel(&init_mm, page_to_virt(pte_page));
+	}
+}
+
+int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	/*
+	 * Use the huge page lru list to temporarily store the preallocated
+	 * pages. The preallocated pages are used and the list is emptied
+	 * before the huge page is put into use. When the huge page is put
+	 * into use by prep_new_huge_page() the list will be reinitialized.
+	 */
+	INIT_LIST_HEAD(&page->lru);
+
+	while (nr--) {
+		pte_t *pte_p;
+
+		pte_p = pte_alloc_one_kernel(&init_mm);
+		if (!pte_p)
+			goto out;
+		list_add(&virt_to_page(pte_p)->lru, &page->lru);
+	}
+
+	return 0;
+out:
+	vmemmap_pgtable_free(page);
+	return -ENOMEM;
+}
+
 void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int order = huge_page_order(h);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 40c0c7dfb60d..9eca6879c0a4 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,9 +12,20 @@
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 void __init hugetlb_vmemmap_init(struct hstate *h);
+int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
+void vmemmap_pgtable_free(struct page *page);
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
 }
+
+static inline int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	return 0;
+}
+
+static inline void vmemmap_pgtable_free(struct page *page)
+{
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (4 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the later patch, we can use the free_vmemmap_page() to free the
unused vmemmap pages and initialize a page for vmemmap page using
via prepare_vmemmap_page().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/bootmem_info.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 4ed6dee1adc9..239e3cc8f86c 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -3,6 +3,7 @@
 #define __LINUX_BOOTMEM_INFO_H
 
 #include <linux/mmzone.h>
+#include <linux/mm.h>
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
@@ -22,6 +23,29 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
 void get_page_bootmem(unsigned long info, struct page *page,
 		      unsigned long type);
 void put_page_bootmem(struct page *page);
+
+static inline void free_vmemmap_page(struct page *page)
+{
+	VM_WARN_ON(!PageReserved(page) || page_ref_count(page) != 2);
+
+	/* bootmem page has reserved flag in the reserve_bootmem_region */
+	if (PageReserved(page)) {
+		unsigned long magic = (unsigned long)page->freelist;
+
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
+			put_page_bootmem(page);
+		else
+			WARN_ON(1);
+	}
+}
+
+static inline void prepare_vmemmap_page(struct page *page)
+{
+	unsigned long section_nr = pfn_to_section_nr(page_to_pfn(page));
+
+	get_page_bootmem(section_nr, page, SECTION_INFO);
+	mark_page_reserved(page);
+}
 #else
 static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (5 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 08/21] mm/hugetlb: Initialize page table lock for vmemmap Muchun Song
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The page->private shares storage with page->ptl. In the later patch,
we will use the page->ptl. So here we combine bootmem info and type
into page->freelist so that we can do not use page->private.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/mm/init_64.c        |  2 +-
 include/linux/bootmem_info.h | 18 ++++++++++++++++--
 mm/bootmem_info.c            | 12 ++++++------
 mm/sparse.c                  |  4 ++--
 4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0435bee2e172..9b738c6cb659 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -883,7 +883,7 @@ static void __meminit free_pagetable(struct page *page, int order)
 	if (PageReserved(page)) {
 		__ClearPageReserved(page);
 
-		magic = (unsigned long)page->freelist;
+		magic = page_bootmem_type(page);
 		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
 			while (nr_pages--)
 				put_page_bootmem(page++);
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 239e3cc8f86c..95ae80838680 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -6,7 +6,7 @@
 #include <linux/mm.h>
 
 /*
- * Types for free bootmem stored in page->lru.next. These have to be in
+ * Types for free bootmem stored in page->freelist. These have to be in
  * some random range in unsigned long space for debugging purposes.
  */
 enum {
@@ -17,6 +17,20 @@ enum {
 	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
 };
 
+#define BOOTMEM_TYPE_BITS	(ilog2(MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE) + 1)
+#define BOOTMEM_TYPE_MAX	((1UL << BOOTMEM_TYPE_BITS) - 1)
+#define BOOTMEM_INFO_MAX	(ULONG_MAX >> BOOTMEM_TYPE_BITS)
+
+static inline unsigned long page_bootmem_type(struct page *page)
+{
+	return (unsigned long)page->freelist & BOOTMEM_TYPE_MAX;
+}
+
+static inline unsigned long page_bootmem_info(struct page *page)
+{
+	return (unsigned long)page->freelist >> BOOTMEM_TYPE_BITS;
+}
+
 #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
 
@@ -30,7 +44,7 @@ static inline void free_vmemmap_page(struct page *page)
 
 	/* bootmem page has reserved flag in the reserve_bootmem_region */
 	if (PageReserved(page)) {
-		unsigned long magic = (unsigned long)page->freelist;
+		unsigned long magic = page_bootmem_type(page);
 
 		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
 			put_page_bootmem(page);
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index fcab5a3f8cc0..9baf163965fd 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -12,9 +12,9 @@
 
 void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
 {
-	page->freelist = (void *)type;
-	SetPagePrivate(page);
-	set_page_private(page, info);
+	BUG_ON(info > BOOTMEM_INFO_MAX);
+	BUG_ON(type > BOOTMEM_TYPE_MAX);
+	page->freelist = (void *)((info << BOOTMEM_TYPE_BITS) | type);
 	page_ref_inc(page);
 }
 
@@ -22,14 +22,12 @@ void put_page_bootmem(struct page *page)
 {
 	unsigned long type;
 
-	type = (unsigned long) page->freelist;
+	type = page_bootmem_type(page);
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
 	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (page_ref_dec_return(page) == 1) {
 		page->freelist = NULL;
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
 		free_reserved_page(page);
 	}
@@ -101,6 +99,8 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 	int node = pgdat->node_id;
 	struct page *page;
 
+	BUILD_BUG_ON(MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE > BOOTMEM_TYPE_MAX);
+
 	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index a4138410d890..fca5fa38c2bc 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -740,12 +740,12 @@ static void free_map_bootmem(struct page *memmap)
 		>> PAGE_SHIFT;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = (unsigned long) page->freelist;
+		magic = page_bootmem_type(page);
 
 		BUG_ON(magic == NODE_INFO);
 
 		maps_section_nr = pfn_to_section_nr(page_to_pfn(page));
-		removing_section_nr = page_private(page);
+		removing_section_nr = page_bootmem_info(page);
 
 		/*
 		 * When this function is called, the removing section is
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 08/21] mm/hugetlb: Initialize page table lock for vmemmap
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (6 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the later patch, we will use the vmemmap page table lock to
guard the splitting of the vmemmap PMD. So initialize the vmemmap
page table lock.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ec70980000d8..bc8546df4a51 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -99,6 +99,8 @@
  */
 #define pr_fmt(fmt)	"HugeTLB Vmemmap: " fmt
 
+#include <linux/pagewalk.h>
+#include <linux/mmzone.h>
 #include <linux/list.h>
 #include <asm/pgalloc.h>
 #include "hugetlb_vmemmap.h"
@@ -208,3 +210,70 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 	pr_debug("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages,
 		 h->name);
 }
+
+static int __init vmemmap_pud_entry(pud_t *pud, unsigned long addr,
+				    unsigned long next, struct mm_walk *walk)
+{
+	struct page *page = pud_page(*pud);
+
+	/*
+	 * The page->private shares storage with page->ptl. So make sure
+	 * that the PG_private is not set and initialize page->private to
+	 * zero.
+	 */
+	VM_BUG_ON_PAGE(PagePrivate(page), page);
+	set_page_private(page, 0);
+
+	BUG_ON(!pmd_ptlock_init(page));
+
+	return 0;
+}
+
+static void __init vmemmap_ptlock_init_section(unsigned long start_pfn)
+{
+	unsigned long section_nr;
+	struct mem_section *ms;
+	struct page *memmap, *memmap_end;
+	struct mm_struct *mm = &init_mm;
+
+	const struct mm_walk_ops ops = {
+		.pud_entry	= vmemmap_pud_entry,
+	};
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+	memmap_end = memmap + PAGES_PER_SECTION;
+
+	mmap_read_lock(mm);
+	BUG_ON(walk_page_range_novma(mm, (unsigned long)memmap,
+				     (unsigned long)memmap_end,
+				     &ops, NULL, NULL));
+	mmap_read_unlock(mm);
+}
+
+static void __init vmemmap_ptlock_init_node(int nid)
+{
+	unsigned long pfn, end_pfn;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+	pfn = pgdat->node_start_pfn;
+	end_pfn = pgdat_end_pfn(pgdat);
+
+	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION)
+		vmemmap_ptlock_init_section(pfn);
+}
+
+static int __init vmemmap_ptlock_init(void)
+{
+	int nid;
+
+	if (!hugepages_supported())
+		return 0;
+
+	for_each_online_node(nid)
+		vmemmap_ptlock_init_node(nid);
+
+	return 0;
+}
+core_initcall(vmemmap_ptlock_init);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (7 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 08/21] mm/hugetlb: Initialize page table lock for vmemmap Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we allocate a hugetlb page from the buddy, we should free the
unused vmemmap pages associated with it. We can do that in the
prep_new_huge_page().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/include/asm/hugetlb.h          |   9 ++
 arch/x86/include/asm/pgtable_64_types.h |   8 ++
 mm/hugetlb.c                            |  16 +++
 mm/hugetlb_vmemmap.c                    | 188 ++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h                    |   5 +
 5 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 1721b1aadeb1..c601fe042832 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -4,6 +4,15 @@
 
 #include <asm/page.h>
 #include <asm-generic/hugetlb.h>
+#include <asm/pgtable.h>
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+#define vmemmap_pmd_huge vmemmap_pmd_huge
+static inline bool vmemmap_pmd_huge(pmd_t *pmd)
+{
+	return pmd_large(*pmd);
+}
+#endif
 
 #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
 
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..bedbd2e7d06c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -139,6 +139,14 @@ extern unsigned int ptrs_per_p4d;
 # define VMEMMAP_START		__VMEMMAP_BASE_L4
 #endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
 
+/*
+ * VMEMMAP_SIZE - allows the whole linear region to be covered by
+ *                a struct page array.
+ */
+#define VMEMMAP_SIZE		(1UL << (__VIRTUAL_MASK_SHIFT - PAGE_SHIFT - \
+					 1 + ilog2(sizeof(struct page))))
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
+
 #define VMALLOC_END		(VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
 
 #define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f88032c24667..a0ce6f33a717 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1499,6 +1499,14 @@ void free_huge_page(struct page *page)
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
+	free_huge_page_vmemmap(h, page);
+	/*
+	 * Because we store preallocated pages on @page->lru,
+	 * vmemmap_pgtable_free() must be called before the
+	 * initialization of @page->lru in INIT_LIST_HEAD().
+	 */
+	vmemmap_pgtable_free(page);
+
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	set_hugetlb_cgroup(page, NULL);
@@ -1751,6 +1759,14 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 	if (!page)
 		return NULL;
 
+	if (vmemmap_pgtable_prealloc(h, page)) {
+		if (hstate_is_gigantic(h))
+			free_gigantic_page(page, huge_page_order(h));
+		else
+			put_page(page);
+		return NULL;
+	}
+
 	if (hstate_is_gigantic(h))
 		prep_compound_gigantic_page(page, huge_page_order(h));
 	prep_new_huge_page(h, page, page_to_nid(page));
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index bc8546df4a51..6f8a735e0dd3 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -102,6 +102,7 @@
 #include <linux/pagewalk.h>
 #include <linux/mmzone.h>
 #include <linux/list.h>
+#include <linux/bootmem_info.h>
 #include <asm/pgalloc.h>
 #include "hugetlb_vmemmap.h"
 
@@ -114,6 +115,8 @@
  * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
  */
 #define RESERVE_VMEMMAP_NR		2U
+#define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+#define TAIL_PAGE_REUSE			-1
 
 #ifndef VMEMMAP_HPAGE_SHIFT
 #define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
@@ -123,6 +126,21 @@
 #define VMEMMAP_HPAGE_SIZE		((1UL) << VMEMMAP_HPAGE_SHIFT)
 #define VMEMMAP_HPAGE_MASK		(~(VMEMMAP_HPAGE_SIZE - 1))
 
+#define vmemmap_hpage_addr_end(addr, end)				 \
+({									 \
+	unsigned long __boundary;					 \
+	__boundary = ((addr) + VMEMMAP_HPAGE_SIZE) & VMEMMAP_HPAGE_MASK; \
+	(__boundary - 1 < (end) - 1) ? __boundary : (end);		 \
+})
+
+#ifndef vmemmap_pmd_huge
+#define vmemmap_pmd_huge vmemmap_pmd_huge
+static inline bool vmemmap_pmd_huge(pmd_t *pmd)
+{
+	return pmd_huge(*pmd);
+}
+#endif
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return h->nr_free_vmemmap_pages;
@@ -189,6 +207,176 @@ int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
 	return -ENOMEM;
 }
 
+/*
+ * Walk a vmemmap address to the pmd it maps.
+ */
+static pmd_t *vmemmap_to_pmd(unsigned long page)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (page < VMEMMAP_START || page >= VMEMMAP_END)
+		return NULL;
+
+	pgd = pgd_offset_k(page);
+	if (pgd_none(*pgd))
+		return NULL;
+	p4d = p4d_offset(pgd, page);
+	if (p4d_none(*p4d))
+		return NULL;
+	pud = pud_offset(p4d, page);
+
+	if (pud_none(*pud) || pud_bad(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, page);
+
+	return pmd;
+}
+
+static inline spinlock_t *vmemmap_pmd_lock(pmd_t *pmd)
+{
+	return pmd_lock(&init_mm, pmd);
+}
+
+static inline int freed_vmemmap_hpage(struct page *page)
+{
+	return atomic_read(&page->_mapcount) + 1;
+}
+
+static inline int freed_vmemmap_hpage_inc(struct page *page)
+{
+	return atomic_inc_return_relaxed(&page->_mapcount) + 1;
+}
+
+static inline int freed_vmemmap_hpage_dec(struct page *page)
+{
+	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
+}
+
+static inline void free_vmemmap_page_list(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		list_del(&page->lru);
+		free_vmemmap_page(page);
+	}
+}
+
+static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
+					 unsigned long start,
+					 unsigned long end,
+					 struct list_head *free_pages)
+{
+	/* Make the tail pages are mapped read-only. */
+	pgprot_t pgprot = PAGE_KERNEL_RO;
+	pte_t entry = mk_pte(reuse, pgprot);
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
+		struct page *page;
+		pte_t old = *ptep;
+
+		VM_WARN_ON(!pte_present(old));
+		page = pte_page(old);
+		list_add(&page->lru, free_pages);
+
+		set_pte_at(&init_mm, addr, ptep, entry);
+	}
+}
+
+static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					 unsigned long addr,
+					 struct list_head *free_pages)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	addr = start;
+	do {
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse)
+			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		__free_huge_page_pte_vmemmap(reuse, ptep, addr, next,
+					     free_pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+static void split_vmemmap_pmd(pmd_t *pmd, pte_t *pte_p, unsigned long addr)
+{
+	int i;
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct mm_struct *mm = &init_mm;
+	struct page *page;
+	pmd_t old_pmd, _pmd;
+
+	old_pmd = READ_ONCE(*pmd);
+	page = pmd_page(old_pmd);
+	pmd_populate_kernel(mm, &_pmd, pte_p);
+
+	for (i = 0; i < VMEMMAP_HPAGE_NR; i++, addr += PAGE_SIZE) {
+		pte_t entry, *pte;
+
+		entry = mk_pte(page + i, pgprot);
+		pte = pte_offset_kernel(&_pmd, addr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, addr, pte, entry);
+	}
+
+	/* make pte visible before pmd */
+	smp_wmb();
+	pmd_populate_kernel(mm, pmd, pte_p);
+}
+
+static void split_vmemmap_huge_page(struct page *head, pmd_t *pmd)
+{
+	struct page *pte_page, *t_page;
+	unsigned long start = (unsigned long)head & VMEMMAP_HPAGE_MASK;
+	unsigned long addr = start;
+
+	list_for_each_entry_safe(pte_page, t_page, &head->lru, lru) {
+		list_del(&pte_page->lru);
+		VM_BUG_ON(freed_vmemmap_hpage(pte_page));
+		split_vmemmap_pmd(pmd++, page_to_virt(pte_page), addr);
+		addr += VMEMMAP_HPAGE_SIZE;
+	}
+
+	flush_tlb_kernel_range(start, addr);
+}
+
+void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	LIST_HEAD(free_pages);
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	pmd = vmemmap_to_pmd((unsigned long)head);
+	BUG_ON(!pmd);
+
+	ptl = vmemmap_pmd_lock(pmd);
+	if (vmemmap_pmd_huge(pmd))
+		split_vmemmap_huge_page(head, pmd);
+
+	__free_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages);
+	freed_vmemmap_hpage_inc(pmd_page(*pmd));
+	spin_unlock(ptl);
+
+	free_vmemmap_page_list(&free_pages);
+}
+
 void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int order = huge_page_order(h);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 9eca6879c0a4..a9425d94ed8b 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -14,6 +14,7 @@
 void __init hugetlb_vmemmap_init(struct hstate *h);
 int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
 void vmemmap_pgtable_free(struct page *page);
+void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -27,5 +28,9 @@ static inline int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
 static inline void vmemmap_pgtable_free(struct page *page)
 {
 }
+
+static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 10/21] mm/hugetlb: Defer freeing of hugetlb pages
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (8 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

In the subsequent patch, we will allocate the vmemmap pages when free
huge pages. But update_and_free_page() is be called from a non-task
context(and hold hugetlb_lock), we can defer the actual freeing in
a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         | 98 +++++++++++++++++++++++++++++++++++++++++++++-------
 mm/hugetlb_vmemmap.c |  5 ---
 mm/hugetlb_vmemmap.h | 10 ++++++
 3 files changed, 96 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a0ce6f33a717..4aabf12aca9b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1221,7 +1221,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned int order)
+static void __free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
 	 * If the page isn't allocated using the cma allocator,
@@ -1288,20 +1288,100 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 {
 	return NULL;
 }
-static inline void free_gigantic_page(struct page *page, unsigned int order) { }
+static inline void __free_gigantic_page(struct page *page,
+					unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
-static void update_and_free_page(struct hstate *h, struct page *page)
+static void __free_hugepage(struct hstate *h, struct page *page);
+
+/*
+ * As update_and_free_page() is be called from a non-task context(and hold
+ * hugetlb_lock), we can defer the actual freeing in a workqueue to prevent
+ * use GFP_ATOMIC to allocate a lot of vmemmap pages.
+ *
+ * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of
+ * pages to be freed and frees them one-by-one. As the page->mapping pointer
+ * is going to be cleared in update_hpage_vmemmap_workfn() anyway, it is
+ * reused as the llist_node structure of a lockless linked list of huge
+ * pages to be freed.
+ */
+static LLIST_HEAD(hpage_update_freelist);
+
+static void update_hpage_vmemmap_workfn(struct work_struct *work)
 {
-	int i;
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_update_freelist);
+
+	while (node) {
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		page->mapping = NULL;
+		__free_hugepage(page_hstate(page), page);
 
+		cond_resched();
+	}
+}
+static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+	/* No need to allocate vmemmap pages */
+	if (!free_vmemmap_pages_per_hpage(h)) {
+		__free_hugepage(h, page);
+		return;
+	}
+
+	/*
+	 * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
+	 * pages.
+	 *
+	 * Only call schedule_work() if hpage_update_freelist is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * hasn't retrieved the list yet.
+	 */
+	if (llist_add((struct llist_node *)&page->mapping,
+		      &hpage_update_freelist))
+		schedule_work(&hpage_update_work);
+}
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	__free_gigantic_page(page, huge_page_order(h));
+}
+#else
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	/*
+	 * Temporarily drop the hugetlb_lock, because
+	 * we might block in __free_gigantic_page().
+	 */
+	spin_unlock(&hugetlb_lock);
+	__free_gigantic_page(page, huge_page_order(h));
+	spin_lock(&hugetlb_lock);
+}
+#endif
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
+
+	__update_and_free_page(h, page);
+}
+
+static void __free_hugepage(struct hstate *h, struct page *page)
+{
+	int i;
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
@@ -1313,14 +1393,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
+		free_gigantic_page(h, page);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
@@ -1761,7 +1835,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 
 	if (vmemmap_pgtable_prealloc(h, page)) {
 		if (hstate_is_gigantic(h))
-			free_gigantic_page(page, huge_page_order(h));
+			free_gigantic_page(h, page);
 		else
 			put_page(page);
 		return NULL;
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 6f8a735e0dd3..eda7e3a0b67c 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -141,11 +141,6 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 }
 #endif
 
-static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-{
-	return h->nr_free_vmemmap_pages;
-}
-
 static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index a9425d94ed8b..4175b44f88bc 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -15,6 +15,11 @@ void __init hugetlb_vmemmap_init(struct hstate *h);
 int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
 void vmemmap_pgtable_free(struct page *page);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return h->nr_free_vmemmap_pages;
+}
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -32,5 +37,10 @@ static inline void vmemmap_pgtable_free(struct page *page)
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (9 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:11   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we free a hugetlb page to the buddy, we should allocate the vmemmap
pages associated with it. We can do that in the __free_hugepage().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         |   2 ++
 mm/hugetlb_vmemmap.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h |   5 +++
 3 files changed, 107 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4aabf12aca9b..ba927ae7f9bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1382,6 +1382,8 @@ static void __free_hugepage(struct hstate *h, struct page *page)
 {
 	int i;
 
+	alloc_huge_page_vmemmap(h, page);
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index eda7e3a0b67c..361c4174e222 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -117,6 +117,8 @@
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 #define TAIL_PAGE_REUSE			-1
+#define GFP_VMEMMAP_PAGE		\
+	(GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
 
 #ifndef VMEMMAP_HPAGE_SHIFT
 #define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
@@ -250,6 +252,104 @@ static inline int freed_vmemmap_hpage_dec(struct page *page)
 	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
 }
 
+static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
+					  unsigned long start,
+					  unsigned long end,
+					  struct list_head *remap_pages)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	void *from = page_to_virt(reuse);
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		void *to;
+		struct page *page;
+		pte_t entry, old = *ptep;
+
+		page = list_first_entry_or_null(remap_pages, struct page, lru);
+		list_del(&page->lru);
+		to = page_to_virt(page);
+		copy_page(to, from);
+
+		/*
+		 * Make sure that any data that writes to the @to is made
+		 * visible to the physical page.
+		 */
+		flush_kernel_vmap_range(to, PAGE_SIZE);
+
+		prepare_vmemmap_page(page);
+
+		entry = mk_pte(page, pgprot);
+		set_pte_at(&init_mm, addr, ptep++, entry);
+
+		VM_BUG_ON(!pte_present(old) || pte_page(old) != reuse);
+	}
+}
+
+static void __remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					  unsigned long addr,
+					  struct list_head *remap_pages)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	addr = start;
+	do {
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse)
+			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		__remap_huge_page_pte_vmemmap(reuse, ptep, addr, next,
+					      remap_pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < free_vmemmap_pages_per_hpage(h); i++) {
+		struct page *page;
+
+		/* This should not fail */
+		page = alloc_page(GFP_VMEMMAP_PAGE);
+		list_add_tail(&page->lru, list);
+	}
+}
+
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	LIST_HEAD(remap_pages);
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	alloc_vmemmap_pages(h, &remap_pages);
+
+	pmd = vmemmap_to_pmd((unsigned long)head);
+	BUG_ON(!pmd);
+
+	ptl = vmemmap_pmd_lock(pmd);
+	__remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head,
+				      &remap_pages);
+	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
+		/*
+		 * Todo:
+		 * Merge pte to huge pmd if it has ever been split.
+		 */
+	}
+	spin_unlock(ptl);
+}
+
 static inline void free_vmemmap_page_list(struct list_head *list)
 {
 	struct page *page, *next;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 4175b44f88bc..6dfa7ed6f88a 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -14,6 +14,7 @@
 void __init hugetlb_vmemmap_init(struct hstate *h);
 int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
 void vmemmap_pgtable_free(struct page *page);
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
@@ -34,6 +35,10 @@ static inline void vmemmap_pgtable_free(struct page *page)
 {
 }
 
+static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (10 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The __free_huge_page_pmd_vmemmap and __remap_huge_page_pmd_vmemmap are
almost the same code. So introduce remap_free_huge_page_pmd_vmemmap
helper to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 108 +++++++++++++++++++++------------------------------
 1 file changed, 45 insertions(+), 63 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 361c4174e222..06e2b8a7b7c8 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -252,6 +252,47 @@ static inline int freed_vmemmap_hpage_dec(struct page *page)
 	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
 }
 
+static inline void free_vmemmap_page_list(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		list_del(&page->lru);
+		free_vmemmap_page(page);
+	}
+}
+
+typedef void (*remap_pte_fn)(struct page *reuse, pte_t *ptep,
+			     unsigned long start, unsigned long end,
+			     struct list_head *pages);
+
+static void remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					unsigned long addr,
+					struct list_head *pages,
+					remap_pte_fn remap_fn)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	flush_cache_vunmap(start, end);
+
+	addr = start;
+	do {
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse)
+			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		remap_fn(reuse, ptep, addr, next, pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
 static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 					  unsigned long start,
 					  unsigned long end,
@@ -286,31 +327,6 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
-static void __remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
-					  unsigned long addr,
-					  struct list_head *remap_pages)
-{
-	unsigned long next;
-	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
-	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
-	struct page *reuse = NULL;
-
-	addr = start;
-	do {
-		pte_t *ptep;
-
-		ptep = pte_offset_kernel(pmd, addr);
-		if (!reuse)
-			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
-
-		next = vmemmap_hpage_addr_end(addr, end);
-		__remap_huge_page_pte_vmemmap(reuse, ptep, addr, next,
-					      remap_pages);
-	} while (pmd++, addr = next, addr != end);
-
-	flush_tlb_kernel_range(start, end);
-}
-
 static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
 {
 	int i;
@@ -339,8 +355,8 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 	BUG_ON(!pmd);
 
 	ptl = vmemmap_pmd_lock(pmd);
-	__remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head,
-				      &remap_pages);
+	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
+				    __remap_huge_page_pte_vmemmap);
 	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
 		/*
 		 * Todo:
@@ -350,16 +366,6 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 	spin_unlock(ptl);
 }
 
-static inline void free_vmemmap_page_list(struct list_head *list)
-{
-	struct page *page, *next;
-
-	list_for_each_entry_safe(page, next, list, lru) {
-		list_del(&page->lru);
-		free_vmemmap_page(page);
-	}
-}
-
 static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 					 unsigned long start,
 					 unsigned long end,
@@ -382,31 +388,6 @@ static void __free_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
-static void __free_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
-					 unsigned long addr,
-					 struct list_head *free_pages)
-{
-	unsigned long next;
-	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
-	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
-	struct page *reuse = NULL;
-
-	addr = start;
-	do {
-		pte_t *ptep;
-
-		ptep = pte_offset_kernel(pmd, addr);
-		if (!reuse)
-			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
-
-		next = vmemmap_hpage_addr_end(addr, end);
-		__free_huge_page_pte_vmemmap(reuse, ptep, addr, next,
-					     free_pages);
-	} while (pmd++, addr = next, addr != end);
-
-	flush_tlb_kernel_range(start, end);
-}
-
 static void split_vmemmap_pmd(pmd_t *pmd, pte_t *pte_p, unsigned long addr)
 {
 	int i;
@@ -465,7 +446,8 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 	if (vmemmap_pmd_huge(pmd))
 		split_vmemmap_huge_page(head, pmd);
 
-	__free_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages);
+	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
+				    __free_huge_page_pte_vmemmap);
 	freed_vmemmap_hpage_inc(pmd_page(*pmd));
 	spin_unlock(ptl);
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (11 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:16   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

When we allocate hugetlb page from buddy, we may need split huge pmd
to pte. When we free the hugetlb page, we can merge pte to pmd. So
we need to distinguish whether the previous pmd has been split. The
page table is not allocated from slab. So we can reuse the PG_slab
to indicate that the pmd has been split.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 06e2b8a7b7c8..e2ddc73ce25f 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -293,6 +293,25 @@ static void remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
 	flush_tlb_kernel_range(start, end);
 }
 
+static inline bool pmd_split(pmd_t *pmd)
+{
+	return PageSlab(pmd_page(*pmd));
+}
+
+static inline void set_pmd_split(pmd_t *pmd)
+{
+	/*
+	 * We should not use slab for page table allocation. So we can set
+	 * PG_slab to indicate that the pmd has been split.
+	 */
+	__SetPageSlab(pmd_page(*pmd));
+}
+
+static inline void clear_pmd_split(pmd_t *pmd)
+{
+	__ClearPageSlab(pmd_page(*pmd));
+}
+
 static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 					  unsigned long start,
 					  unsigned long end,
@@ -357,11 +376,12 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 	ptl = vmemmap_pmd_lock(pmd);
 	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
 				    __remap_huge_page_pte_vmemmap);
-	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
+	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
 		/*
 		 * Todo:
 		 * Merge pte to huge pmd if it has ever been split.
 		 */
+		clear_pmd_split(pmd);
 	}
 	spin_unlock(ptl);
 }
@@ -443,8 +463,10 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 	BUG_ON(!pmd);
 
 	ptl = vmemmap_pmd_lock(pmd);
-	if (vmemmap_pmd_huge(pmd))
+	if (vmemmap_pmd_huge(pmd)) {
 		split_vmemmap_huge_page(head, pmd);
+		set_pmd_split(pmd);
+	}
 
 	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
 				    __free_huge_page_pte_vmemmap);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (12 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

The gigantic page is allocated by bootmem, if we want to free the
unused vmemmap pages. We also should allocate the page table. So
we also allocate page tables from bootmem.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            |  5 +++++
 mm/hugetlb_vmemmap.c    | 60 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h    | 13 +++++++++++
 4 files changed, 81 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index eed3dd3bd626..da18fc9ed152 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -506,6 +506,9 @@ struct hstate {
 struct huge_bootmem_page {
 	struct list_head list;
 	struct hstate *hstate;
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	pte_t *vmemmap_pte;
+#endif
 };
 
 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ba927ae7f9bd..055604d07046 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2607,6 +2607,7 @@ static void __init gather_bootmem_prealloc(void)
 		WARN_ON(page_count(page) != 1);
 		prep_compound_huge_page(page, h->order);
 		WARN_ON(PageReserved(page));
+		gather_vmemmap_pgtable_init(m, page);
 		prep_new_huge_page(h, page, page_to_nid(page));
 		put_page(page); /* free it into the hugepage allocator */
 
@@ -2659,6 +2660,10 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			break;
 		cond_resched();
 	}
+
+	if (hstate_is_gigantic(h))
+		i -= gather_vmemmap_pgtable_prealloc();
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e2ddc73ce25f..3629165d8158 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -103,6 +103,7 @@
 #include <linux/mmzone.h>
 #include <linux/list.h>
 #include <linux/bootmem_info.h>
+#include <linux/memblock.h>
 #include <asm/pgalloc.h>
 #include "hugetlb_vmemmap.h"
 
@@ -204,6 +205,65 @@ int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
 	return -ENOMEM;
 }
 
+unsigned long __init gather_vmemmap_pgtable_prealloc(void)
+{
+	struct huge_bootmem_page *m, *tmp;
+	unsigned long nr_free = 0;
+
+	list_for_each_entry_safe(m, tmp, &huge_boot_pages, list) {
+		struct hstate *h = m->hstate;
+		unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+		unsigned int pgtable_size;
+
+		if (!nr)
+			continue;
+
+		pgtable_size = nr << PAGE_SHIFT;
+		m->vmemmap_pte = memblock_alloc_try_nid(pgtable_size,
+				PAGE_SIZE, 0, MEMBLOCK_ALLOC_ACCESSIBLE,
+				NUMA_NO_NODE);
+		if (!m->vmemmap_pte) {
+			nr_free++;
+			list_del(&m->list);
+			memblock_free_early(__pa(m), huge_page_size(h));
+		}
+	}
+
+	return nr_free;
+}
+
+void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+					struct page *page)
+{
+	struct hstate *h = m->hstate;
+	unsigned long pte = (unsigned long)m->vmemmap_pte;
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	/*
+	 * Use the huge page lru list to temporarily store the preallocated
+	 * pages. The preallocated pages are used and the list is emptied
+	 * before the huge page is put into use. When the huge page is put
+	 * into use by prep_new_huge_page() the list will be reinitialized.
+	 */
+	INIT_LIST_HEAD(&page->lru);
+
+	while (nr--) {
+		struct page *pte_page = virt_to_page(pte);
+
+		__ClearPageReserved(pte_page);
+		list_add(&pte_page->lru, &page->lru);
+		pte += PAGE_SIZE;
+	}
+
+	/*
+	 * If we had gigantic hugepages allocated at boot time, we need
+	 * to restore the 'stolen' pages to totalram_pages in order to
+	 * fix confusing memory reports from free(1) and another
+	 * side-effects, like CommitLimit going negative.
+	 */
+	adjust_managed_page_count(page, nr);
+}
+
 /*
  * Walk a vmemmap address to the pmd it maps.
  */
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 6dfa7ed6f88a..779d3cb9333f 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -14,6 +14,9 @@
 void __init hugetlb_vmemmap_init(struct hstate *h);
 int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
 void vmemmap_pgtable_free(struct page *page);
+unsigned long __init gather_vmemmap_pgtable_prealloc(void);
+void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+					struct page *page);
 void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 
@@ -35,6 +38,16 @@ static inline void vmemmap_pgtable_free(struct page *page)
 {
 }
 
+static inline unsigned long gather_vmemmap_pgtable_prealloc(void)
+{
+	return 0;
+}
+
+static inline void gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+					       struct page *page)
+{
+}
+
 static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (13 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:19   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Because we reuse the first tail page, if we set PageHWPosion on a
tail page. It indicates that we may set PageHWPoison on a series
of pages. So we can use the head[4].mapping to record the real
error page index and set the raw error page PageHWPoison later.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         | 11 +++--------
 mm/hugetlb_vmemmap.h | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 055604d07046..b853aacd5c16 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1383,6 +1383,7 @@ static void __free_hugepage(struct hstate *h, struct page *page)
 	int i;
 
 	alloc_huge_page_vmemmap(h, page);
+	subpage_hwpoison_deliver(page);
 
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1944,14 +1945,8 @@ int dissolve_free_huge_page(struct page *page)
 		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
-		/*
-		 * Move PageHWPoison flag from head page to the raw error page,
-		 * which makes any subpages rather than the error page reusable.
-		 */
-		if (PageHWPoison(head) && page != head) {
-			SetPageHWPoison(page);
-			ClearPageHWPoison(head);
-		}
+
+		set_subpage_hwpoison(head, page);
 		list_del(&head->lru);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 779d3cb9333f..65e94436ffff 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -20,6 +20,29 @@ void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
 void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 
+static inline void subpage_hwpoison_deliver(struct page *head)
+{
+	struct page *page = head;
+
+	if (PageHWPoison(head))
+		page = head + page_private(head + 4);
+
+	/*
+	 * Move PageHWPoison flag from head page to the raw error page,
+	 * which makes any subpages rather than the error page reusable.
+	 */
+	if (page != head) {
+		SetPageHWPoison(page);
+		ClearPageHWPoison(head);
+	}
+}
+
+static inline void set_subpage_hwpoison(struct page *head, struct page *page)
+{
+	if (PageHWPoison(head))
+		set_page_private(head + 4, page - head);
+}
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return h->nr_free_vmemmap_pages;
@@ -56,6 +79,22 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
 
+static inline void subpage_hwpoison_deliver(struct page *head)
+{
+}
+
+static inline void set_subpage_hwpoison(struct page *head, struct page *page)
+{
+	/*
+	 * Move PageHWPoison flag from head page to the raw error page,
+	 * which makes any subpages rather than the error page reusable.
+	 */
+	if (PageHWPoison(head) && page != head) {
+		SetPageHWPoison(page);
+		ClearPageHWPoison(head);
+	}
+}
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return 0;
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (14 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:20   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

We should flush work when dissolving a hugetlb page to make sure that
the hugetlb page is freed to the buddy.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b853aacd5c16..9aad0b63d369 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1328,6 +1328,12 @@ static void update_hpage_vmemmap_workfn(struct work_struct *work)
 }
 static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
 
+static inline void flush_hpage_update_work(struct hstate *h)
+{
+	if (free_vmemmap_pages_per_hpage(h))
+		flush_work(&hpage_update_work);
+}
+
 static inline void __update_and_free_page(struct hstate *h, struct page *page)
 {
 	/* No need to allocate vmemmap pages */
@@ -1928,6 +1934,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 int dissolve_free_huge_page(struct page *page)
 {
 	int rc = -EBUSY;
+	struct hstate *h = NULL;
 
 	/* Not to disrupt normal path by vainly holding hugetlb_lock */
 	if (!PageHuge(page))
@@ -1941,8 +1948,9 @@ int dissolve_free_huge_page(struct page *page)
 
 	if (!page_count(page)) {
 		struct page *head = compound_head(page);
-		struct hstate *h = page_hstate(head);
 		int nid = page_to_nid(head);
+
+		h = page_hstate(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
 
@@ -1956,6 +1964,14 @@ int dissolve_free_huge_page(struct page *page)
 	}
 out:
 	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * We should flush work before return to make sure that
+	 * the HugeTLB page is freed to the buddy.
+	 */
+	if (!rc && h)
+		flush_hpage_update_work(h);
+
 	return rc;
 }
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (15 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:22   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Add a kernel parameter hugetlb_free_vmemmap to disable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +++++++++
 Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +++
 mm/hugetlb_vmemmap.c                            | 21 +++++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5debfe238027..ccf07293cb63 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1551,6 +1551,15 @@
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugetlb_free_vmemmap=
+			[KNL] When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set,
+			this controls freeing unused vmemmap pages associated
+			with each HugeTLB page.
+			Format: { on (default) | off }
+
+			on:  enable the feature
+			off: disable the feature
+
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
 			Format: 0 | 1
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f7b1c7462991..7d6129ee97dd 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,6 +145,9 @@ default_hugepagesz
 
 	will all result in 256 2M huge pages being allocated.  Valid default
 	huge page size is architecture dependent.
+hugetlb_free_vmemmap
+	When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this disables freeing
+	unused vmemmap pages associated each HugeTLB page.
 
 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
 indicates the current number of pre-allocated huge pages of the default size.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 3629165d8158..c958699d1393 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -144,6 +144,22 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 }
 #endif
 
+static bool hugetlb_free_vmemmap_disabled __initdata;
+
+static int __init early_hugetlb_free_vmemmap_param(char *buf)
+{
+	if (!buf)
+		return -EINVAL;
+
+	if (!strcmp(buf, "off"))
+		hugetlb_free_vmemmap_disabled = true;
+	else if (strcmp(buf, "on"))
+		return -EINVAL;
+
+	return 0;
+}
+early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
+
 static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
@@ -541,6 +557,11 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 	unsigned int order = huge_page_order(h);
 	unsigned int vmemmap_pages;
 
+	if (hugetlb_free_vmemmap_disabled) {
+		pr_info("disable free vmemmap pages for %s\n", h->name);
+		return;
+	}
+
 	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
 	/*
 	 * The head page and the first tail page are not to be freed to buddy
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (16 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:23   ` Michal Hocko
  2020-11-20  6:43 ` [PATCH v5 19/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

Merge pte to huge pmd if it has ever been split. Now only support
gigantic page which's vmemmap pages size is an integer multiple of
PMD_SIZE. This is the simplest case to handle.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/include/asm/hugetlb.h |   8 +++
 mm/hugetlb_vmemmap.c           | 118 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index c601fe042832..1de1c519a84a 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -12,6 +12,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 {
 	return pmd_large(*pmd);
 }
+
+#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
+static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
+{
+	pte_t entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_LARGE);
+
+	return __pmd(pte_val(entry));
+}
 #endif
 
 #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c958699d1393..bf2b6b3e75af 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -144,6 +144,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
 }
 #endif
 
+#ifndef vmemmap_pmd_mkhuge
+#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
+static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
+{
+	return pmd_mkhuge(mk_pmd(page, PAGE_KERNEL));
+}
+#endif
+
 static bool hugetlb_free_vmemmap_disabled __initdata;
 
 static int __init early_hugetlb_free_vmemmap_param(char *buf)
@@ -422,6 +430,104 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
 	}
 }
 
+static void __replace_huge_page_pte_vmemmap(pte_t *ptep, unsigned long start,
+					    unsigned int nr, struct page *huge,
+					    struct list_head *free_pages)
+{
+	unsigned long addr;
+	unsigned long end = start + (nr << PAGE_SHIFT);
+	pgprot_t pgprot = PAGE_KERNEL;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
+		struct page *page;
+		pte_t old = *ptep;
+		pte_t entry;
+
+		prepare_vmemmap_page(huge);
+
+		entry = mk_pte(huge++, pgprot);
+		VM_WARN_ON(!pte_present(old));
+		page = pte_page(old);
+		list_add(&page->lru, free_pages);
+
+		set_pte_at(&init_mm, addr, ptep, entry);
+	}
+}
+
+static void replace_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
+					  struct page *huge,
+					  struct list_head *free_pages)
+{
+	unsigned long end = start + VMEMMAP_HPAGE_SIZE;
+
+	flush_cache_vunmap(start, end);
+	__replace_huge_page_pte_vmemmap(pte_offset_kernel(pmd, start), start,
+					VMEMMAP_HPAGE_NR, huge, free_pages);
+	flush_tlb_kernel_range(start, end);
+}
+
+static pte_t *merge_vmemmap_pte(pmd_t *pmdp, unsigned long addr)
+{
+	pte_t *pte;
+	struct page *page;
+
+	pte = pte_offset_kernel(pmdp, addr);
+	page = pte_page(*pte);
+	set_pmd(pmdp, vmemmap_pmd_mkhuge(page));
+
+	return pte;
+}
+
+static void merge_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
+					struct page *huge,
+					struct list_head *free_pages)
+{
+	replace_huge_page_pmd_vmemmap(pmd, start, huge, free_pages);
+	pte_free_kernel(&init_mm, merge_vmemmap_pte(pmd, start));
+	flush_tlb_kernel_range(start, start + VMEMMAP_HPAGE_SIZE);
+}
+
+static inline void dissolve_compound_page(struct page *page, unsigned int order)
+{
+	int i;
+	unsigned int nr_pages = 1 << order;
+
+	for (i = 1; i < nr_pages; i++)
+		set_page_count(page + i, 1);
+}
+
+static void merge_gigantic_page_vmemmap(struct hstate *h, struct page *head,
+					pmd_t *pmd)
+{
+	LIST_HEAD(free_pages);
+	unsigned long addr = (unsigned long)head;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+
+	for (; addr < end; addr += VMEMMAP_HPAGE_SIZE) {
+		void *to;
+		struct page *page;
+
+		page = alloc_pages(GFP_VMEMMAP_PAGE & ~__GFP_NOFAIL,
+				   VMEMMAP_HPAGE_ORDER);
+		if (!page)
+			goto out;
+
+		dissolve_compound_page(page, VMEMMAP_HPAGE_ORDER);
+		to = page_to_virt(page);
+		memcpy(to, (void *)addr, VMEMMAP_HPAGE_SIZE);
+
+		/*
+		 * Make sure that any data that writes to the
+		 * @to is made visible to the physical page.
+		 */
+		flush_kernel_vmap_range(to, VMEMMAP_HPAGE_SIZE);
+
+		merge_huge_page_pmd_vmemmap(pmd++, addr, page, &free_pages);
+	}
+out:
+	free_vmemmap_page_list(&free_pages);
+}
+
 static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
 {
 	int i;
@@ -454,10 +560,18 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 				    __remap_huge_page_pte_vmemmap);
 	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
 		/*
-		 * Todo:
-		 * Merge pte to huge pmd if it has ever been split.
+		 * Merge pte to huge pmd if it has ever been split. Now only
+		 * support gigantic page which's vmemmap pages size is an
+		 * integer multiple of PMD_SIZE. This is the simplest case
+		 * to handle.
 		 */
 		clear_pmd_split(pmd);
+
+		if (IS_ALIGNED(vmemmap_pages_per_hpage(h), VMEMMAP_HPAGE_NR)) {
+			spin_unlock(ptl);
+			merge_gigantic_page_vmemmap(h, head, pmd);
+			return;
+		}
 	}
 	spin_unlock(ptl);
 }
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 19/21] mm/hugetlb: Gather discrete indexes of tail page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (17 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 20/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

For hugetlb page, there are more metadata to save in the struct
page. But the head struct page cannot meet our needs, so we have
to abuse other tail struct page to store the metadata. In order
to avoid conflicts caused by subsequent use of more tail struct
pages, we can gather these discrete indexes of tail struct page
In this case, it will be easier to add a new tail page index later.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h        | 13 +++++++++++++
 include/linux/hugetlb_cgroup.h | 15 +++++++++------
 mm/hugetlb.c                   | 12 ++++++------
 mm/hugetlb_vmemmap.h           |  4 ++--
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index da18fc9ed152..fa9d38a3ac6f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -28,6 +28,19 @@ typedef struct { unsigned long pd; } hugepd_t;
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+enum {
+	SUBPAGE_INDEX_ACTIVE = 1,	/* reuse page flags of PG_private */
+	SUBPAGE_INDEX_TEMPORARY,	/* reuse page->mapping */
+#ifdef CONFIG_CGROUP_HUGETLB
+	SUBPAGE_INDEX_CGROUP = SUBPAGE_INDEX_TEMPORARY,/* reuse page->private */
+	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	SUBPAGE_INDEX_HWPOISON,		/* reuse page->private */
+#endif
+	NR_USED_SUBPAGE,
+};
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2ad6e92f124a..3d3c1c49efe4 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -24,8 +24,9 @@ struct file_region;
 /*
  * Minimum page order trackable by hugetlb cgroup.
  * At least 4 pages are necessary for all the tracking information.
- * The second tail page (hpage[2]) is the fault usage cgroup.
- * The third tail page (hpage[3]) is the reservation usage cgroup.
+ * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
+ * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
+ * is the reservation usage cgroup.
  */
 #define HUGETLB_CGROUP_MIN_ORDER	2
 
@@ -66,9 +67,9 @@ __hugetlb_cgroup_from_page(struct page *page, bool rsvd)
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
 	if (rsvd)
-		return (struct hugetlb_cgroup *)page[3].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD);
 	else
-		return (struct hugetlb_cgroup *)page[2].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP);
 }
 
 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -90,9 +91,11 @@ static inline int __set_hugetlb_cgroup(struct page *page,
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
 	if (rsvd)
-		page[3].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD,
+				 (unsigned long)h_cg);
 	else
-		page[2].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP,
+				 (unsigned long)h_cg);
 	return 0;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9aad0b63d369..dfa982f4b525 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1429,20 +1429,20 @@ struct hstate *size_to_hstate(unsigned long size)
 bool page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);
-	return PageHead(page) && PagePrivate(&page[1]);
+	return PageHead(page) && PagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 /* never called for tail page */
 static void set_page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	SetPagePrivate(&page[1]);
+	SetPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 static void clear_page_huge_active(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	ClearPagePrivate(&page[1]);
+	ClearPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
 }
 
 /*
@@ -1454,17 +1454,17 @@ static inline bool PageHugeTemporary(struct page *page)
 	if (!PageHuge(page))
 		return false;
 
-	return (unsigned long)page[2].mapping == -1U;
+	return (unsigned long)page[SUBPAGE_INDEX_TEMPORARY].mapping == -1U;
 }
 
 static inline void SetPageHugeTemporary(struct page *page)
 {
-	page[2].mapping = (void *)-1U;
+	page[SUBPAGE_INDEX_TEMPORARY].mapping = (void *)-1U;
 }
 
 static inline void ClearPageHugeTemporary(struct page *page)
 {
-	page[2].mapping = NULL;
+	page[SUBPAGE_INDEX_TEMPORARY].mapping = NULL;
 }
 
 static void __free_huge_page(struct page *page)
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 65e94436ffff..d9c1f45e93ae 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -25,7 +25,7 @@ static inline void subpage_hwpoison_deliver(struct page *head)
 	struct page *page = head;
 
 	if (PageHWPoison(head))
-		page = head + page_private(head + 4);
+		page = head + page_private(head + SUBPAGE_INDEX_HWPOISON);
 
 	/*
 	 * Move PageHWPoison flag from head page to the raw error page,
@@ -40,7 +40,7 @@ static inline void subpage_hwpoison_deliver(struct page *head)
 static inline void set_subpage_hwpoison(struct page *head, struct page *page)
 {
 	if (PageHWPoison(head))
-		set_page_private(head + 4, page - head);
+		set_page_private(head + SUBPAGE_INDEX_HWPOISON, page - head);
 }
 
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 20/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (18 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 19/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  6:43 ` [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two Muchun Song
  2020-11-20  8:42 ` [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Michal Hocko
  21 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

There are only `RESERVE_VMEMMAP_SIZE / sizeof(struct page)` struct pages
can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON
to catch this invalid usage of tail struct page.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index bf2b6b3e75af..c3b3fc041903 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -750,6 +750,9 @@ static int __init vmemmap_ptlock_init(void)
 {
 	int nid;
 
+	BUILD_BUG_ON(NR_USED_SUBPAGE >=
+		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
+
 	if (!hugepages_supported())
 		return 0;
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (19 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 20/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
@ 2020-11-20  6:43 ` Muchun Song
  2020-11-20  8:25   ` Michal Hocko
  2020-11-20  9:16   ` David Hildenbrand
  2020-11-20  8:42 ` [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Michal Hocko
  21 siblings, 2 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  6:43 UTC (permalink / raw)
  To: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
	Muchun Song

We only can free the unused vmemmap to the buddy system when the
size of struct page is a power of two.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c3b3fc041903..7bb749a3eea2 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -671,7 +671,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
 	unsigned int order = huge_page_order(h);
 	unsigned int vmemmap_pages;
 
-	if (hugetlb_free_vmemmap_disabled) {
+	if (hugetlb_free_vmemmap_disabled ||
+	    !is_power_of_2(sizeof(struct page))) {
 		pr_info("disable free vmemmap pages for %s\n", h->name);
 		return;
 	}
-- 
2.11.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-20  6:43 ` [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
@ 2020-11-20  7:49   ` Michal Hocko
  2020-11-20  8:35     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  7:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:07, Muchun Song wrote:
> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> whether to enable the feature of freeing unused vmemmap associated
> with HugeTLB pages. Now only support x86.

Why is the config option necessary? Are code savings with the feature
disabled really worth it? I can see that your later patch adds a kernel
command line option. I believe that is a more reasonable way to control
the feature. I would argue that this should be an opt-in rather than
opt-out though. Think of users of pre-built (e.g. distribution kernels)
who might be interested in the feature. Yet you cannot assume that such
a kernel would enable the feature with its overhead to all hugetlb
users.

That being said, unless there are huge advantages to introduce a
config option I would rather not add it because our config space is huge
already and the more we add the more future code maintainance that will
add. If you want the config just for dependency checks then fine by me.
 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  arch/x86/mm/init_64.c |  2 +-
>  fs/Kconfig            | 14 ++++++++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>  
>  static void __init register_page_bootmem_info(void)
>  {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>  	int i;
>  
>  	for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..4961dd488444 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,20 @@ config HUGETLBFS
>  config HUGETLB_PAGE
>  	def_bool HUGETLBFS
>  
> +config HUGETLB_PAGE_FREE_VMEMMAP
> +	def_bool HUGETLB_PAGE
> +	depends on X86
> +	depends on SPARSEMEM_VMEMMAP
> +	depends on HAVE_BOOTMEM_INFO_NODE
> +	help
> +	  When using HUGETLB_PAGE_FREE_VMEMMAP, the system can save up some
> +	  memory from pre-allocated HugeTLB pages when they are not used.
> +	  6 pages per 2MB HugeTLB page and 4094 per 1GB HugeTLB page.
> +
> +	  When the pages are going to be used or freed up, the vmemmap array
> +	  representing that range needs to be remapped again and the pages
> +	  we discarded earlier need to be rellocated again.
> +
>  config MEMFD_CREATE
>  	def_bool TMPFS || HUGETLBFS
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  6:43 ` [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
@ 2020-11-20  8:11   ` Michal Hocko
  2020-11-20  8:51     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:11 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:15, Muchun Song wrote:
[...]
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index eda7e3a0b67c..361c4174e222 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -117,6 +117,8 @@
>  #define RESERVE_VMEMMAP_NR		2U
>  #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
>  #define TAIL_PAGE_REUSE			-1
> +#define GFP_VMEMMAP_PAGE		\
> +	(GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)

This is really dangerous! __GFP_MEMALLOC would allow a complete memory
depletion. I am not even sure triggering the OOM killer is a reasonable
behavior. It is just unexpected that shrinking a hugetlb pool can have
destructive side effects. I believe it would be more reasonable to
simply refuse to shrink the pool if we cannot free those pages up. This
sucks as well but it isn't destructive at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-20  6:43 ` [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
@ 2020-11-20  8:16   ` Michal Hocko
  2020-11-20  9:30     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:16 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:17, Muchun Song wrote:
> When we allocate hugetlb page from buddy, we may need split huge pmd
> to pte. When we free the hugetlb page, we can merge pte to pmd. So
> we need to distinguish whether the previous pmd has been split. The
> page table is not allocated from slab. So we can reuse the PG_slab
> to indicate that the pmd has been split.

PageSlab is used outside of the slab allocator proper and that code
might get confused by this AFAICS.

From the above description it is not really clear why this is needed
though. Who is supposed to use this? Say you are allocating a fresh
hugetlb page. Once you have it, nobody else can be interfering. It is
exclusive to the caller. The later machinery can check the vmemmap page
tables to find out whether a split is needed or not. Or do I miss
something?

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb_vmemmap.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 06e2b8a7b7c8..e2ddc73ce25f 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -293,6 +293,25 @@ static void remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
>  	flush_tlb_kernel_range(start, end);
>  }
>  
> +static inline bool pmd_split(pmd_t *pmd)
> +{
> +	return PageSlab(pmd_page(*pmd));
> +}
> +
> +static inline void set_pmd_split(pmd_t *pmd)
> +{
> +	/*
> +	 * We should not use slab for page table allocation. So we can set
> +	 * PG_slab to indicate that the pmd has been split.
> +	 */
> +	__SetPageSlab(pmd_page(*pmd));
> +}
> +
> +static inline void clear_pmd_split(pmd_t *pmd)
> +{
> +	__ClearPageSlab(pmd_page(*pmd));
> +}
> +
>  static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
>  					  unsigned long start,
>  					  unsigned long end,
> @@ -357,11 +376,12 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
>  	ptl = vmemmap_pmd_lock(pmd);
>  	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
>  				    __remap_huge_page_pte_vmemmap);
> -	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
> +	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
>  		/*
>  		 * Todo:
>  		 * Merge pte to huge pmd if it has ever been split.
>  		 */
> +		clear_pmd_split(pmd);
>  	}
>  	spin_unlock(ptl);
>  }
> @@ -443,8 +463,10 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
>  	BUG_ON(!pmd);
>  
>  	ptl = vmemmap_pmd_lock(pmd);
> -	if (vmemmap_pmd_huge(pmd))
> +	if (vmemmap_pmd_huge(pmd)) {
>  		split_vmemmap_huge_page(head, pmd);
> +		set_pmd_split(pmd);
> +	}
>  
>  	remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
>  				    __free_huge_page_pte_vmemmap);
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page
  2020-11-20  6:43 ` [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
@ 2020-11-20  8:19   ` Michal Hocko
  2020-11-20 10:32     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:19 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:19, Muchun Song wrote:
> Because we reuse the first tail page, if we set PageHWPosion on a
> tail page. It indicates that we may set PageHWPoison on a series
> of pages. So we can use the head[4].mapping to record the real
> error page index and set the raw error page PageHWPoison later.

This really begs more explanation. Maybe I misremember but If there
is a HWPoison hole in a hugepage then the whole page is demolished, no?
If that is the case then why do we care about tail pages?
 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c         | 11 +++--------
>  mm/hugetlb_vmemmap.h | 39 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 055604d07046..b853aacd5c16 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1383,6 +1383,7 @@ static void __free_hugepage(struct hstate *h, struct page *page)
>  	int i;
>  
>  	alloc_huge_page_vmemmap(h, page);
> +	subpage_hwpoison_deliver(page);
>  
>  	for (i = 0; i < pages_per_huge_page(h); i++) {
>  		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1944,14 +1945,8 @@ int dissolve_free_huge_page(struct page *page)
>  		int nid = page_to_nid(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
> -		/*
> -		 * Move PageHWPoison flag from head page to the raw error page,
> -		 * which makes any subpages rather than the error page reusable.
> -		 */
> -		if (PageHWPoison(head) && page != head) {
> -			SetPageHWPoison(page);
> -			ClearPageHWPoison(head);
> -		}
> +
> +		set_subpage_hwpoison(head, page);
>  		list_del(&head->lru);
>  		h->free_huge_pages--;
>  		h->free_huge_pages_node[nid]--;
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 779d3cb9333f..65e94436ffff 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -20,6 +20,29 @@ void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
>  void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
>  void free_huge_page_vmemmap(struct hstate *h, struct page *head);
>  
> +static inline void subpage_hwpoison_deliver(struct page *head)
> +{
> +	struct page *page = head;
> +
> +	if (PageHWPoison(head))
> +		page = head + page_private(head + 4);
> +
> +	/*
> +	 * Move PageHWPoison flag from head page to the raw error page,
> +	 * which makes any subpages rather than the error page reusable.
> +	 */
> +	if (page != head) {
> +		SetPageHWPoison(page);
> +		ClearPageHWPoison(head);
> +	}
> +}
> +
> +static inline void set_subpage_hwpoison(struct page *head, struct page *page)
> +{
> +	if (PageHWPoison(head))
> +		set_page_private(head + 4, page - head);
> +}
> +
>  static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
>  {
>  	return h->nr_free_vmemmap_pages;
> @@ -56,6 +79,22 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
>  {
>  }
>  
> +static inline void subpage_hwpoison_deliver(struct page *head)
> +{
> +}
> +
> +static inline void set_subpage_hwpoison(struct page *head, struct page *page)
> +{
> +	/*
> +	 * Move PageHWPoison flag from head page to the raw error page,
> +	 * which makes any subpages rather than the error page reusable.
> +	 */
> +	if (PageHWPoison(head) && page != head) {
> +		SetPageHWPoison(page);
> +		ClearPageHWPoison(head);
> +	}
> +}
> +
>  static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
>  {
>  	return 0;
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page
  2020-11-20  6:43 ` [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
@ 2020-11-20  8:20   ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:20 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:20, Muchun Song wrote:
> We should flush work when dissolving a hugetlb page to make sure that
> the hugetlb page is freed to the buddy.

Why? This explanation on its own doen't really help to understand what
is the point of the patch.

> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b853aacd5c16..9aad0b63d369 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1328,6 +1328,12 @@ static void update_hpage_vmemmap_workfn(struct work_struct *work)
>  }
>  static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
>  
> +static inline void flush_hpage_update_work(struct hstate *h)
> +{
> +	if (free_vmemmap_pages_per_hpage(h))
> +		flush_work(&hpage_update_work);
> +}
> +
>  static inline void __update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	/* No need to allocate vmemmap pages */
> @@ -1928,6 +1934,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  int dissolve_free_huge_page(struct page *page)
>  {
>  	int rc = -EBUSY;
> +	struct hstate *h = NULL;
>  
>  	/* Not to disrupt normal path by vainly holding hugetlb_lock */
>  	if (!PageHuge(page))
> @@ -1941,8 +1948,9 @@ int dissolve_free_huge_page(struct page *page)
>  
>  	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
> -		struct hstate *h = page_hstate(head);
>  		int nid = page_to_nid(head);
> +
> +		h = page_hstate(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
>  
> @@ -1956,6 +1964,14 @@ int dissolve_free_huge_page(struct page *page)
>  	}
>  out:
>  	spin_unlock(&hugetlb_lock);
> +
> +	/*
> +	 * We should flush work before return to make sure that
> +	 * the HugeTLB page is freed to the buddy.
> +	 */
> +	if (!rc && h)
> +		flush_hpage_update_work(h);
> +
>  	return rc;
>  }
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  2020-11-20  6:43 ` [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
@ 2020-11-20  8:22   ` Michal Hocko
  2020-11-20 10:39     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:22 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:21, Muchun Song wrote:
> Add a kernel parameter hugetlb_free_vmemmap to disable the feature of
> freeing unused vmemmap pages associated with each hugetlb page on boot.

As replied to the config patch. This is fine but I would argue that the
default should be flipped. Saving memory is nice but it comes with
overhead and therefore should be an opt-in. The config option should
only guard compile time dependencies not a user choice.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  9 +++++++++
>  Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +++
>  mm/hugetlb_vmemmap.c                            | 21 +++++++++++++++++++++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 5debfe238027..ccf07293cb63 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1551,6 +1551,15 @@
>  			Documentation/admin-guide/mm/hugetlbpage.rst.
>  			Format: size[KMG]
>  
> +	hugetlb_free_vmemmap=
> +			[KNL] When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set,
> +			this controls freeing unused vmemmap pages associated
> +			with each HugeTLB page.
> +			Format: { on (default) | off }
> +
> +			on:  enable the feature
> +			off: disable the feature
> +
>  	hung_task_panic=
>  			[KNL] Should the hung task detector generate panics.
>  			Format: 0 | 1
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index f7b1c7462991..7d6129ee97dd 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -145,6 +145,9 @@ default_hugepagesz
>  
>  	will all result in 256 2M huge pages being allocated.  Valid default
>  	huge page size is architecture dependent.
> +hugetlb_free_vmemmap
> +	When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this disables freeing
> +	unused vmemmap pages associated each HugeTLB page.
>  
>  When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
>  indicates the current number of pre-allocated huge pages of the default size.
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 3629165d8158..c958699d1393 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -144,6 +144,22 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
>  }
>  #endif
>  
> +static bool hugetlb_free_vmemmap_disabled __initdata;
> +
> +static int __init early_hugetlb_free_vmemmap_param(char *buf)
> +{
> +	if (!buf)
> +		return -EINVAL;
> +
> +	if (!strcmp(buf, "off"))
> +		hugetlb_free_vmemmap_disabled = true;
> +	else if (strcmp(buf, "on"))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
> +
>  static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
>  {
>  	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
> @@ -541,6 +557,11 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>  	unsigned int order = huge_page_order(h);
>  	unsigned int vmemmap_pages;
>  
> +	if (hugetlb_free_vmemmap_disabled) {
> +		pr_info("disable free vmemmap pages for %s\n", h->name);
> +		return;
> +	}
> +
>  	vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
>  	/*
>  	 * The head page and the first tail page are not to be freed to buddy
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page
  2020-11-20  6:43 ` [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
@ 2020-11-20  8:23   ` Michal Hocko
  2020-11-20 10:41     ` [External] " Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:23 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:22, Muchun Song wrote:
> Merge pte to huge pmd if it has ever been split. Now only support
> gigantic page which's vmemmap pages size is an integer multiple of
> PMD_SIZE. This is the simplest case to handle.

I think it would be benefitial for anybody who plan to implement this
for normal PMDs to document challenges while you still have them fresh
in your mind.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  arch/x86/include/asm/hugetlb.h |   8 +++
>  mm/hugetlb_vmemmap.c           | 118 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 124 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> index c601fe042832..1de1c519a84a 100644
> --- a/arch/x86/include/asm/hugetlb.h
> +++ b/arch/x86/include/asm/hugetlb.h
> @@ -12,6 +12,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
>  {
>  	return pmd_large(*pmd);
>  }
> +
> +#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
> +static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
> +{
> +	pte_t entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_LARGE);
> +
> +	return __pmd(pte_val(entry));
> +}
>  #endif
>  
>  #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c958699d1393..bf2b6b3e75af 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -144,6 +144,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
>  }
>  #endif
>  
> +#ifndef vmemmap_pmd_mkhuge
> +#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
> +static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
> +{
> +	return pmd_mkhuge(mk_pmd(page, PAGE_KERNEL));
> +}
> +#endif
> +
>  static bool hugetlb_free_vmemmap_disabled __initdata;
>  
>  static int __init early_hugetlb_free_vmemmap_param(char *buf)
> @@ -422,6 +430,104 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
>  	}
>  }
>  
> +static void __replace_huge_page_pte_vmemmap(pte_t *ptep, unsigned long start,
> +					    unsigned int nr, struct page *huge,
> +					    struct list_head *free_pages)
> +{
> +	unsigned long addr;
> +	unsigned long end = start + (nr << PAGE_SHIFT);
> +	pgprot_t pgprot = PAGE_KERNEL;
> +
> +	for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
> +		struct page *page;
> +		pte_t old = *ptep;
> +		pte_t entry;
> +
> +		prepare_vmemmap_page(huge);
> +
> +		entry = mk_pte(huge++, pgprot);
> +		VM_WARN_ON(!pte_present(old));
> +		page = pte_page(old);
> +		list_add(&page->lru, free_pages);
> +
> +		set_pte_at(&init_mm, addr, ptep, entry);
> +	}
> +}
> +
> +static void replace_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
> +					  struct page *huge,
> +					  struct list_head *free_pages)
> +{
> +	unsigned long end = start + VMEMMAP_HPAGE_SIZE;
> +
> +	flush_cache_vunmap(start, end);
> +	__replace_huge_page_pte_vmemmap(pte_offset_kernel(pmd, start), start,
> +					VMEMMAP_HPAGE_NR, huge, free_pages);
> +	flush_tlb_kernel_range(start, end);
> +}
> +
> +static pte_t *merge_vmemmap_pte(pmd_t *pmdp, unsigned long addr)
> +{
> +	pte_t *pte;
> +	struct page *page;
> +
> +	pte = pte_offset_kernel(pmdp, addr);
> +	page = pte_page(*pte);
> +	set_pmd(pmdp, vmemmap_pmd_mkhuge(page));
> +
> +	return pte;
> +}
> +
> +static void merge_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
> +					struct page *huge,
> +					struct list_head *free_pages)
> +{
> +	replace_huge_page_pmd_vmemmap(pmd, start, huge, free_pages);
> +	pte_free_kernel(&init_mm, merge_vmemmap_pte(pmd, start));
> +	flush_tlb_kernel_range(start, start + VMEMMAP_HPAGE_SIZE);
> +}
> +
> +static inline void dissolve_compound_page(struct page *page, unsigned int order)
> +{
> +	int i;
> +	unsigned int nr_pages = 1 << order;
> +
> +	for (i = 1; i < nr_pages; i++)
> +		set_page_count(page + i, 1);
> +}
> +
> +static void merge_gigantic_page_vmemmap(struct hstate *h, struct page *head,
> +					pmd_t *pmd)
> +{
> +	LIST_HEAD(free_pages);
> +	unsigned long addr = (unsigned long)head;
> +	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
> +
> +	for (; addr < end; addr += VMEMMAP_HPAGE_SIZE) {
> +		void *to;
> +		struct page *page;
> +
> +		page = alloc_pages(GFP_VMEMMAP_PAGE & ~__GFP_NOFAIL,
> +				   VMEMMAP_HPAGE_ORDER);
> +		if (!page)
> +			goto out;
> +
> +		dissolve_compound_page(page, VMEMMAP_HPAGE_ORDER);
> +		to = page_to_virt(page);
> +		memcpy(to, (void *)addr, VMEMMAP_HPAGE_SIZE);
> +
> +		/*
> +		 * Make sure that any data that writes to the
> +		 * @to is made visible to the physical page.
> +		 */
> +		flush_kernel_vmap_range(to, VMEMMAP_HPAGE_SIZE);
> +
> +		merge_huge_page_pmd_vmemmap(pmd++, addr, page, &free_pages);
> +	}
> +out:
> +	free_vmemmap_page_list(&free_pages);
> +}
> +
>  static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
>  {
>  	int i;
> @@ -454,10 +560,18 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
>  				    __remap_huge_page_pte_vmemmap);
>  	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
>  		/*
> -		 * Todo:
> -		 * Merge pte to huge pmd if it has ever been split.
> +		 * Merge pte to huge pmd if it has ever been split. Now only
> +		 * support gigantic page which's vmemmap pages size is an
> +		 * integer multiple of PMD_SIZE. This is the simplest case
> +		 * to handle.
>  		 */
>  		clear_pmd_split(pmd);
> +
> +		if (IS_ALIGNED(vmemmap_pages_per_hpage(h), VMEMMAP_HPAGE_NR)) {
> +			spin_unlock(ptl);
> +			merge_gigantic_page_vmemmap(h, head, pmd);
> +			return;
> +		}
>  	}
>  	spin_unlock(ptl);
>  }
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  6:43 ` [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two Muchun Song
@ 2020-11-20  8:25   ` Michal Hocko
  2020-11-20  9:15     ` David Hildenbrand
  2020-11-22 19:00     ` Matthew Wilcox
  2020-11-20  9:16   ` David Hildenbrand
  1 sibling, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:25 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:25, Muchun Song wrote:
> We only can free the unused vmemmap to the buddy system when the
> size of struct page is a power of two.

Can we actually have !power_of_2 struct pages?

> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb_vmemmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c3b3fc041903..7bb749a3eea2 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -671,7 +671,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>  	unsigned int order = huge_page_order(h);
>  	unsigned int vmemmap_pages;
>  
> -	if (hugetlb_free_vmemmap_disabled) {
> +	if (hugetlb_free_vmemmap_disabled ||
> +	    !is_power_of_2(sizeof(struct page))) {
>  		pr_info("disable free vmemmap pages for %s\n", h->name);
>  		return;
>  	}
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-20  7:49   ` Michal Hocko
@ 2020-11-20  8:35     ` Muchun Song
  2020-11-20  8:47       ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  8:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 3:49 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:07, Muchun Song wrote:
> > The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> > whether to enable the feature of freeing unused vmemmap associated
> > with HugeTLB pages. Now only support x86.
>
> Why is the config option necessary? Are code savings with the feature
> disabled really worth it? I can see that your later patch adds a kernel
> command line option. I believe that is a more reasonable way to control
> the feature. I would argue that this should be an opt-in rather than
> opt-out though. Think of users of pre-built (e.g. distribution kernels)
> who might be interested in the feature. Yet you cannot assume that such
> a kernel would enable the feature with its overhead to all hugetlb
> users.

Now the config option may be necessary. Because the feature only
supports x86. While other architectures need some code to support
this feature. In the future, we will implement it on other architectures.
Then, we can remove this option.

Also, this config option is not optional. It is default by the
CONFIG_HUGETLB_PAGE. If the kernel selects the
CONFIG_HUGETLB_PAGE, the CONFIG_ HUGETLB_PAGE_FREE_VMEMMAP
is also selected. The user only can disable this feature by
boot command line :).

Thanks.

>
> That being said, unless there are huge advantages to introduce a
> config option I would rather not add it because our config space is huge
> already and the more we add the more future code maintainance that will
> add. If you want the config just for dependency checks then fine by me.

Yeah, it is only for dependency checks :)

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  arch/x86/mm/init_64.c |  2 +-
> >  fs/Kconfig            | 14 ++++++++++++++
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 0a45f062826e..0435bee2e172 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >
> >  static void __init register_page_bootmem_info(void)
> >  {
> > -#ifdef CONFIG_NUMA
> > +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> >       int i;
> >
> >       for_each_online_node(i)
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 976e8b9033c4..4961dd488444 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -245,6 +245,20 @@ config HUGETLBFS
> >  config HUGETLB_PAGE
> >       def_bool HUGETLBFS
> >
> > +config HUGETLB_PAGE_FREE_VMEMMAP
> > +     def_bool HUGETLB_PAGE
> > +     depends on X86
> > +     depends on SPARSEMEM_VMEMMAP
> > +     depends on HAVE_BOOTMEM_INFO_NODE
> > +     help
> > +       When using HUGETLB_PAGE_FREE_VMEMMAP, the system can save up some
> > +       memory from pre-allocated HugeTLB pages when they are not used.
> > +       6 pages per 2MB HugeTLB page and 4094 per 1GB HugeTLB page.
> > +
> > +       When the pages are going to be used or freed up, the vmemmap array
> > +       representing that range needs to be remapped again and the pages
> > +       we discarded earlier need to be rellocated again.
> > +
> >  config MEMFD_CREATE
> >       def_bool TMPFS || HUGETLBFS
> >
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
                   ` (20 preceding siblings ...)
  2020-11-20  6:43 ` [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two Muchun Song
@ 2020-11-20  8:42 ` Michal Hocko
  2020-11-20  9:27   ` David Hildenbrand
  2020-11-20 12:40   ` [External] " Muchun Song
  21 siblings, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 14:43:04, Muchun Song wrote:
[...]

Thanks for improving the cover letter and providing some numbers. I have
only glanced through the patchset because I didn't really have more time
to dive depply into them.

Overall it looks promissing. To summarize. I would prefer to not have
the feature enablement controlled by compile time option and the kernel
command line option should be opt-in. I also do not like that freeing
the pool can trigger the oom killer or even shut the system down if no
oom victim is eligible.

One thing that I didn't really get to think hard about is what is the
effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
invalid when racing with the split. How do we enforce that this won't
blow up?

I have also asked in a previous version whether the vmemmap manipulation
should be really unconditional. E.g. shortlived hugetlb pages allocated
from the buddy allocator directly rather than for a pool. Maybe it
should be restricted for the pool allocation as those are considered
long term and therefore the overhead will be amortized and freeing path
restrictions better understandable.

>  Documentation/admin-guide/kernel-parameters.txt |   9 +
>  Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
>  arch/x86/include/asm/hugetlb.h                  |  17 +
>  arch/x86/include/asm/pgtable_64_types.h         |   8 +
>  arch/x86/mm/init_64.c                           |   7 +-
>  fs/Kconfig                                      |  14 +
>  include/linux/bootmem_info.h                    |  78 +++
>  include/linux/hugetlb.h                         |  19 +
>  include/linux/hugetlb_cgroup.h                  |  15 +-
>  include/linux/memory_hotplug.h                  |  27 -
>  mm/Makefile                                     |   2 +
>  mm/bootmem_info.c                               | 124 ++++
>  mm/hugetlb.c                                    | 163 ++++-
>  mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
>  mm/hugetlb_vmemmap.h                            | 103 ++++

I will need to look closer but I suspect that a non-trivial part of the
vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the
split and remapping shouldn't really be hugetlb specific. Sure hugetlb
knows how to split but all the splitting should be implemented in
vmemmap proper.

>  mm/memory_hotplug.c                             | 116 ----
>  mm/sparse.c                                     |   5 +-
>  17 files changed, 1295 insertions(+), 180 deletions(-)
>  create mode 100644 include/linux/bootmem_info.h
>  create mode 100644 mm/bootmem_info.c
>  create mode 100644 mm/hugetlb_vmemmap.c
>  create mode 100644 mm/hugetlb_vmemmap.h

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-20  8:35     ` [External] " Muchun Song
@ 2020-11-20  8:47       ` Michal Hocko
  2020-11-20  8:53         ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  8:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 16:35:16, Muchun Song wrote:
[...]
> > That being said, unless there are huge advantages to introduce a
> > config option I would rather not add it because our config space is huge
> > already and the more we add the more future code maintainance that will
> > add. If you want the config just for dependency checks then fine by me.
> 
> Yeah, it is only for dependency checks :)

OK, I must have misread the definition to think that it requires user to
enable explicitly.

Anyway this feature cannot be really on by default due to overhead. So
the command line option default has to be flipped.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  8:11   ` Michal Hocko
@ 2020-11-20  8:51     ` Muchun Song
  2020-11-20  9:28       ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  8:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:15, Muchun Song wrote:
> [...]
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index eda7e3a0b67c..361c4174e222 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -117,6 +117,8 @@
> >  #define RESERVE_VMEMMAP_NR           2U
> >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> >  #define TAIL_PAGE_REUSE                      -1
> > +#define GFP_VMEMMAP_PAGE             \
> > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
>
> This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> depletion. I am not even sure triggering the OOM killer is a reasonable
> behavior. It is just unexpected that shrinking a hugetlb pool can have
> destructive side effects. I believe it would be more reasonable to
> simply refuse to shrink the pool if we cannot free those pages up. This
> sucks as well but it isn't destructive at least.

I find the instructions of __GFP_MEMALLOC from the kernel doc.

%__GFP_MEMALLOC allows access to all memory. This should only be used when
the caller guarantees the allocation will allow more memory to be freed
very shortly.

Our situation is in line with the description above. We will free a HugeTLB page
to the buddy allocator which is much larger than that we allocated shortly.

Thanks.

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2020-11-20  8:47       ` Michal Hocko
@ 2020-11-20  8:53         ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20  8:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:47 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 16:35:16, Muchun Song wrote:
> [...]
> > > That being said, unless there are huge advantages to introduce a
> > > config option I would rather not add it because our config space is huge
> > > already and the more we add the more future code maintainance that will
> > > add. If you want the config just for dependency checks then fine by me.
> >
> > Yeah, it is only for dependency checks :)
>
> OK, I must have misread the definition to think that it requires user to
> enable explicitly.
>
> Anyway this feature cannot be really on by default due to overhead. So
> the command line option default has to be flipped.

Got it. Thanks for your suggestion.

>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  8:25   ` Michal Hocko
@ 2020-11-20  9:15     ` David Hildenbrand
  2020-11-22 13:30       ` Mike Rapoport
  2020-11-22 19:00     ` Matthew Wilcox
  1 sibling, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2020-11-20  9:15 UTC (permalink / raw)
  To: Michal Hocko, Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 20.11.20 09:25, Michal Hocko wrote:
> On Fri 20-11-20 14:43:25, Muchun Song wrote:
>> We only can free the unused vmemmap to the buddy system when the
>> size of struct page is a power of two.
> 
> Can we actually have !power_of_2 struct pages?

AFAIK multiples of 8 bytes (56, 64, 72) are possible.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  6:43 ` [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two Muchun Song
  2020-11-20  8:25   ` Michal Hocko
@ 2020-11-20  9:16   ` David Hildenbrand
  2020-11-20 10:42     ` [External] " Muchun Song
  1 sibling, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2020-11-20  9:16 UTC (permalink / raw)
  To: Muchun Song, corbet, mike.kravetz, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, mhocko, song.bao.hua
  Cc: duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 20.11.20 07:43, Muchun Song wrote:
> We only can free the unused vmemmap to the buddy system when the
> size of struct page is a power of two.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>   mm/hugetlb_vmemmap.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c3b3fc041903..7bb749a3eea2 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -671,7 +671,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
>   	unsigned int order = huge_page_order(h);
>   	unsigned int vmemmap_pages;
>   
> -	if (hugetlb_free_vmemmap_disabled) {
> +	if (hugetlb_free_vmemmap_disabled ||
> +	    !is_power_of_2(sizeof(struct page))) {
>   		pr_info("disable free vmemmap pages for %s\n", h->name);
>   		return;
>   	}
> 

This patch should be merged into the original patch that introduced 
vmemmap freeing.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  8:42 ` [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Michal Hocko
@ 2020-11-20  9:27   ` David Hildenbrand
  2020-11-20  9:39     ` Michal Hocko
  2020-11-20 12:40   ` [External] " Muchun Song
  1 sibling, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2020-11-20  9:27 UTC (permalink / raw)
  To: Michal Hocko, Muchun Song
  Cc: corbet, mike.kravetz, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 20.11.20 09:42, Michal Hocko wrote:
> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> [...]
> 
> Thanks for improving the cover letter and providing some numbers. I have
> only glanced through the patchset because I didn't really have more time
> to dive depply into them.
> 
> Overall it looks promissing. To summarize. I would prefer to not have
> the feature enablement controlled by compile time option and the kernel
> command line option should be opt-in. I also do not like that freeing
> the pool can trigger the oom killer or even shut the system down if no
> oom victim is eligible.
> 
> One thing that I didn't really get to think hard about is what is the
> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> invalid when racing with the split. How do we enforce that this won't
> blow up?

I have the same concerns - the sections are online the whole time and 
anybody with pfn_to_online_page() can grab them

I think we have similar issues with memory offlining when removing the 
vmemmap, it's just very hard to trigger and we can easily protect by 
grabbing the memhotplug lock. I once discussed with Dan using rcu to 
protect the SECTION_IS_ONLINE bit, to make sure anybody who did a 
pfn_to_online_page() stopped using the page. Of course, such an approach 
is not easy to use in this context where the sections stay online the 
whole time ... we would have to protect vmemmap table entries using rcu 
or similar, which can get quite ugly.

To keep things easy, maybe simply never allow to free these hugetlb 
pages again for now? If they were reserved during boot and the vmemmap 
condensed, then just let them stick around for all eternity.

Once we have a safe approach on how to modify an online vmemmap, we can 
enable this freeing, and eventually also dynamically manage vmemmaps for 
runtime-allocated huge pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  8:51     ` [External] " Muchun Song
@ 2020-11-20  9:28       ` Michal Hocko
  2020-11-20  9:37         ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  9:28 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 16:51:59, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > [...]
> > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > index eda7e3a0b67c..361c4174e222 100644
> > > --- a/mm/hugetlb_vmemmap.c
> > > +++ b/mm/hugetlb_vmemmap.c
> > > @@ -117,6 +117,8 @@
> > >  #define RESERVE_VMEMMAP_NR           2U
> > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > >  #define TAIL_PAGE_REUSE                      -1
> > > +#define GFP_VMEMMAP_PAGE             \
> > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> >
> > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > depletion. I am not even sure triggering the OOM killer is a reasonable
> > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > destructive side effects. I believe it would be more reasonable to
> > simply refuse to shrink the pool if we cannot free those pages up. This
> > sucks as well but it isn't destructive at least.
> 
> I find the instructions of __GFP_MEMALLOC from the kernel doc.
> 
> %__GFP_MEMALLOC allows access to all memory. This should only be used when
> the caller guarantees the allocation will allow more memory to be freed
> very shortly.
> 
> Our situation is in line with the description above. We will free a HugeTLB page
> to the buddy allocator which is much larger than that we allocated shortly.

Yes that is a part of the description. But read it in its full entirety.
 * %__GFP_MEMALLOC allows access to all memory. This should only be used when
 * the caller guarantees the allocation will allow more memory to be freed
 * very shortly e.g. process exiting or swapping. Users either should
 * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
 * Users of this flag have to be extremely careful to not deplete the reserve
 * completely and implement a throttling mechanism which controls the
 * consumption of the reserve based on the amount of freed memory.
 * Usage of a pre-allocated pool (e.g. mempool) should be always considered
 * before using this flag.

GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH

sounds like a more reasonable fit to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-20  8:16   ` Michal Hocko
@ 2020-11-20  9:30     ` Muchun Song
  2020-11-23  7:48       ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  9:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:16 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:17, Muchun Song wrote:
> > When we allocate hugetlb page from buddy, we may need split huge pmd
> > to pte. When we free the hugetlb page, we can merge pte to pmd. So
> > we need to distinguish whether the previous pmd has been split. The
> > page table is not allocated from slab. So we can reuse the PG_slab
> > to indicate that the pmd has been split.
>
> PageSlab is used outside of the slab allocator proper and that code
> might get confused by this AFAICS.

I got your concerns. Maybe we can use PG_private instead of the
PG_slab.

>
> From the above description it is not really clear why this is needed
> though. Who is supposed to use this? Say you are allocating a fresh
> hugetlb page. Once you have it, nobody else can be interfering. It is
> exclusive to the caller. The later machinery can check the vmemmap page
> tables to find out whether a split is needed or not. Or do I miss
> something?

Yeah, the commit log needs some improvement. The vmemmap pages
can use huge page mapping or basepage(e.g. 4KB) mapping. These two
cases may exist at the same time. I want to know which page size the
vmemmap pages mapping to. If we have split a PMD page table then
we set the flag, when we free the HugeTLB and the flag is set, we want
to merge the PTE page table to PMD. If the flag is not set, we do nothing
about the PTE page table.

Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb_vmemmap.c | 26 ++++++++++++++++++++++++--
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 06e2b8a7b7c8..e2ddc73ce25f 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -293,6 +293,25 @@ static void remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
> >       flush_tlb_kernel_range(start, end);
> >  }
> >
> > +static inline bool pmd_split(pmd_t *pmd)
> > +{
> > +     return PageSlab(pmd_page(*pmd));
> > +}
> > +
> > +static inline void set_pmd_split(pmd_t *pmd)
> > +{
> > +     /*
> > +      * We should not use slab for page table allocation. So we can set
> > +      * PG_slab to indicate that the pmd has been split.
> > +      */
> > +     __SetPageSlab(pmd_page(*pmd));
> > +}
> > +
> > +static inline void clear_pmd_split(pmd_t *pmd)
> > +{
> > +     __ClearPageSlab(pmd_page(*pmd));
> > +}
> > +
> >  static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
> >                                         unsigned long start,
> >                                         unsigned long end,
> > @@ -357,11 +376,12 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
> >       ptl = vmemmap_pmd_lock(pmd);
> >       remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &remap_pages,
> >                                   __remap_huge_page_pte_vmemmap);
> > -     if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
> > +     if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
> >               /*
> >                * Todo:
> >                * Merge pte to huge pmd if it has ever been split.
> >                */
> > +             clear_pmd_split(pmd);
> >       }
> >       spin_unlock(ptl);
> >  }
> > @@ -443,8 +463,10 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> >       BUG_ON(!pmd);
> >
> >       ptl = vmemmap_pmd_lock(pmd);
> > -     if (vmemmap_pmd_huge(pmd))
> > +     if (vmemmap_pmd_huge(pmd)) {
> >               split_vmemmap_huge_page(head, pmd);
> > +             set_pmd_split(pmd);
> > +     }
> >
> >       remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head, &free_pages,
> >                                   __free_huge_page_pte_vmemmap);
> > --
> > 2.11.0
> >
>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  9:28       ` Michal Hocko
@ 2020-11-20  9:37         ` Muchun Song
  2020-11-20 11:10           ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > [...]
> > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > index eda7e3a0b67c..361c4174e222 100644
> > > > --- a/mm/hugetlb_vmemmap.c
> > > > +++ b/mm/hugetlb_vmemmap.c
> > > > @@ -117,6 +117,8 @@
> > > >  #define RESERVE_VMEMMAP_NR           2U
> > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > >  #define TAIL_PAGE_REUSE                      -1
> > > > +#define GFP_VMEMMAP_PAGE             \
> > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > >
> > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > destructive side effects. I believe it would be more reasonable to
> > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > sucks as well but it isn't destructive at least.
> >
> > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> >
> > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > the caller guarantees the allocation will allow more memory to be freed
> > very shortly.
> >
> > Our situation is in line with the description above. We will free a HugeTLB page
> > to the buddy allocator which is much larger than that we allocated shortly.
>
> Yes that is a part of the description. But read it in its full entirety.
>  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
>  * the caller guarantees the allocation will allow more memory to be freed
>  * very shortly e.g. process exiting or swapping. Users either should
>  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>  * Users of this flag have to be extremely careful to not deplete the reserve
>  * completely and implement a throttling mechanism which controls the
>  * consumption of the reserve based on the amount of freed memory.
>  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
>  * before using this flag.
>
> GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH

We want to free the HugeTLB page to the buddy allocator, but before that,
we need to allocate some pages as vmemmap pages, so here we cannot
handle allocation failures. I think that we should replace the
__GFP_RETRY_MAYFAIL to __GFP_NOFAIL.

GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH

This meets our needs here. Thanks.

>
> sounds like a more reasonable fit to me.
>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  9:27   ` David Hildenbrand
@ 2020-11-20  9:39     ` Michal Hocko
  2020-11-20  9:43       ` David Hildenbrand
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20  9:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, corbet, mike.kravetz, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
> On 20.11.20 09:42, Michal Hocko wrote:
> > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > [...]
> > 
> > Thanks for improving the cover letter and providing some numbers. I have
> > only glanced through the patchset because I didn't really have more time
> > to dive depply into them.
> > 
> > Overall it looks promissing. To summarize. I would prefer to not have
> > the feature enablement controlled by compile time option and the kernel
> > command line option should be opt-in. I also do not like that freeing
> > the pool can trigger the oom killer or even shut the system down if no
> > oom victim is eligible.
> > 
> > One thing that I didn't really get to think hard about is what is the
> > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > invalid when racing with the split. How do we enforce that this won't
> > blow up?
> 
> I have the same concerns - the sections are online the whole time and
> anybody with pfn_to_online_page() can grab them
> 
> I think we have similar issues with memory offlining when removing the
> vmemmap, it's just very hard to trigger and we can easily protect by
> grabbing the memhotplug lock.

I am not sure we can/want to span memory hotplug locking out to all pfn
walkers. But you are right that the underlying problem is similar but
much harder to trigger because vmemmaps are only removed when the
physical memory is hotremoved and that happens very seldom. Maybe it
will happen more with virtualization usecases. But this work makes it
even more tricky. If a pfn walker races with a hotremove then it would
just blow up when accessing the unmapped physical address space. For
this feature a pfn walker would just grab a real struct page re-used for
some unpredictable use under its feet. Any failure would be silent and
hard to debug.

[...]
> To keep things easy, maybe simply never allow to free these hugetlb pages
> again for now? If they were reserved during boot and the vmemmap condensed,
> then just let them stick around for all eternity.

Not sure I understand. Do you propose to only free those vmemmap pages
when the pool is initialized during boot time and never allow to free
them up? That would certainly make it safer and maybe even simpler wrt
implementation.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  9:39     ` Michal Hocko
@ 2020-11-20  9:43       ` David Hildenbrand
  2020-11-20 17:45         ` Mike Kravetz
  0 siblings, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2020-11-20  9:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, corbet, mike.kravetz, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 20.11.20 10:39, Michal Hocko wrote:
> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>> On 20.11.20 09:42, Michal Hocko wrote:
>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>> [...]
>>>
>>> Thanks for improving the cover letter and providing some numbers. I have
>>> only glanced through the patchset because I didn't really have more time
>>> to dive depply into them.
>>>
>>> Overall it looks promissing. To summarize. I would prefer to not have
>>> the feature enablement controlled by compile time option and the kernel
>>> command line option should be opt-in. I also do not like that freeing
>>> the pool can trigger the oom killer or even shut the system down if no
>>> oom victim is eligible.
>>>
>>> One thing that I didn't really get to think hard about is what is the
>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>> invalid when racing with the split. How do we enforce that this won't
>>> blow up?
>>
>> I have the same concerns - the sections are online the whole time and
>> anybody with pfn_to_online_page() can grab them
>>
>> I think we have similar issues with memory offlining when removing the
>> vmemmap, it's just very hard to trigger and we can easily protect by
>> grabbing the memhotplug lock.
> 
> I am not sure we can/want to span memory hotplug locking out to all pfn
> walkers. But you are right that the underlying problem is similar but
> much harder to trigger because vmemmaps are only removed when the
> physical memory is hotremoved and that happens very seldom. Maybe it
> will happen more with virtualization usecases. But this work makes it
> even more tricky. If a pfn walker races with a hotremove then it would
> just blow up when accessing the unmapped physical address space. For
> this feature a pfn walker would just grab a real struct page re-used for
> some unpredictable use under its feet. Any failure would be silent and
> hard to debug.

Right, we don't want the memory hotplug locking, thus discussions 
regarding rcu. Luckily, for now I never saw a BUG report regarding this 
- maybe because the time between memory offlining (offline_pages()) and 
memory/vmemmap getting removed (try_remove_memory()) is just too long. 
Someone would have to sleep after pfn_to_online_page() for quite a while 
to trigger it.

> 
> [...]
>> To keep things easy, maybe simply never allow to free these hugetlb pages
>> again for now? If they were reserved during boot and the vmemmap condensed,
>> then just let them stick around for all eternity.
> 
> Not sure I understand. Do you propose to only free those vmemmap pages
> when the pool is initialized during boot time and never allow to free
> them up? That would certainly make it safer and maybe even simpler wrt
> implementation.

Exactly, let's keep it simple for now. I guess most use cases of this 
(virtualization, databases, ...) will allocate hugepages during boot and 
never free them.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page
  2020-11-20  8:19   ` Michal Hocko
@ 2020-11-20 10:32     ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:19 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:19, Muchun Song wrote:
> > Because we reuse the first tail page, if we set PageHWPosion on a
> > tail page. It indicates that we may set PageHWPoison on a series
> > of pages. So we can use the head[4].mapping to record the real
> > error page index and set the raw error page PageHWPoison later.
>
> This really begs more explanation. Maybe I misremember but If there
> is a HWPoison hole in a hugepage then the whole page is demolished, no?
> If that is the case then why do we care about tail pages?

It seems like that I should make the commit log more clear. If there is
a HWPoison hole in a HugeTLB, we should dissolve the HugeTLB page.
It means that we set the HWPoison on the raw error page(not the head
page) and free the HugeTLB to the buddy allocator. Then we will remove
only one HWPoison page from the buddy free list. You can see the
take_page_off_buddy() for more details. Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb.c         | 11 +++--------
> >  mm/hugetlb_vmemmap.h | 39 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 42 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 055604d07046..b853aacd5c16 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1383,6 +1383,7 @@ static void __free_hugepage(struct hstate *h, struct page *page)
> >       int i;
> >
> >       alloc_huge_page_vmemmap(h, page);
> > +     subpage_hwpoison_deliver(page);
> >
> >       for (i = 0; i < pages_per_huge_page(h); i++) {
> >               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
> > @@ -1944,14 +1945,8 @@ int dissolve_free_huge_page(struct page *page)
> >               int nid = page_to_nid(head);
> >               if (h->free_huge_pages - h->resv_huge_pages == 0)
> >                       goto out;
> > -             /*
> > -              * Move PageHWPoison flag from head page to the raw error page,
> > -              * which makes any subpages rather than the error page reusable.
> > -              */
> > -             if (PageHWPoison(head) && page != head) {
> > -                     SetPageHWPoison(page);
> > -                     ClearPageHWPoison(head);
> > -             }
> > +
> > +             set_subpage_hwpoison(head, page);
> >               list_del(&head->lru);
> >               h->free_huge_pages--;
> >               h->free_huge_pages_node[nid]--;
> > diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> > index 779d3cb9333f..65e94436ffff 100644
> > --- a/mm/hugetlb_vmemmap.h
> > +++ b/mm/hugetlb_vmemmap.h
> > @@ -20,6 +20,29 @@ void __init gather_vmemmap_pgtable_init(struct huge_bootmem_page *m,
> >  void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
> >  void free_huge_page_vmemmap(struct hstate *h, struct page *head);
> >
> > +static inline void subpage_hwpoison_deliver(struct page *head)
> > +{
> > +     struct page *page = head;
> > +
> > +     if (PageHWPoison(head))
> > +             page = head + page_private(head + 4);
> > +
> > +     /*
> > +      * Move PageHWPoison flag from head page to the raw error page,
> > +      * which makes any subpages rather than the error page reusable.
> > +      */
> > +     if (page != head) {
> > +             SetPageHWPoison(page);
> > +             ClearPageHWPoison(head);
> > +     }
> > +}
> > +
> > +static inline void set_subpage_hwpoison(struct page *head, struct page *page)
> > +{
> > +     if (PageHWPoison(head))
> > +             set_page_private(head + 4, page - head);
> > +}
> > +
> >  static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> >  {
> >       return h->nr_free_vmemmap_pages;
> > @@ -56,6 +79,22 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> >  {
> >  }
> >
> > +static inline void subpage_hwpoison_deliver(struct page *head)
> > +{
> > +}
> > +
> > +static inline void set_subpage_hwpoison(struct page *head, struct page *page)
> > +{
> > +     /*
> > +      * Move PageHWPoison flag from head page to the raw error page,
> > +      * which makes any subpages rather than the error page reusable.
> > +      */
> > +     if (PageHWPoison(head) && page != head) {
> > +             SetPageHWPoison(page);
> > +             ClearPageHWPoison(head);
> > +     }
> > +}
> > +
> >  static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> >  {
> >       return 0;
> > --
> > 2.11.0
> >
>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  2020-11-20  8:22   ` Michal Hocko
@ 2020-11-20 10:39     ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20 10:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:22 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:21, Muchun Song wrote:
> > Add a kernel parameter hugetlb_free_vmemmap to disable the feature of
> > freeing unused vmemmap pages associated with each hugetlb page on boot.
>
> As replied to the config patch. This is fine but I would argue that the
> default should be flipped. Saving memory is nice but it comes with
> overhead and therefore should be an opt-in. The config option should
> only guard compile time dependencies not a user choice.

Got it. The default will be flipped in the next version. Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt |  9 +++++++++
> >  Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +++
> >  mm/hugetlb_vmemmap.c                            | 21 +++++++++++++++++++++
> >  3 files changed, 33 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 5debfe238027..ccf07293cb63 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1551,6 +1551,15 @@
> >                       Documentation/admin-guide/mm/hugetlbpage.rst.
> >                       Format: size[KMG]
> >
> > +     hugetlb_free_vmemmap=
> > +                     [KNL] When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set,
> > +                     this controls freeing unused vmemmap pages associated
> > +                     with each HugeTLB page.
> > +                     Format: { on (default) | off }
> > +
> > +                     on:  enable the feature
> > +                     off: disable the feature
> > +
> >       hung_task_panic=
> >                       [KNL] Should the hung task detector generate panics.
> >                       Format: 0 | 1
> > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> > index f7b1c7462991..7d6129ee97dd 100644
> > --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> > @@ -145,6 +145,9 @@ default_hugepagesz
> >
> >       will all result in 256 2M huge pages being allocated.  Valid default
> >       huge page size is architecture dependent.
> > +hugetlb_free_vmemmap
> > +     When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this disables freeing
> > +     unused vmemmap pages associated each HugeTLB page.
> >
> >  When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
> >  indicates the current number of pre-allocated huge pages of the default size.
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 3629165d8158..c958699d1393 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -144,6 +144,22 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
> >  }
> >  #endif
> >
> > +static bool hugetlb_free_vmemmap_disabled __initdata;
> > +
> > +static int __init early_hugetlb_free_vmemmap_param(char *buf)
> > +{
> > +     if (!buf)
> > +             return -EINVAL;
> > +
> > +     if (!strcmp(buf, "off"))
> > +             hugetlb_free_vmemmap_disabled = true;
> > +     else if (strcmp(buf, "on"))
> > +             return -EINVAL;
> > +
> > +     return 0;
> > +}
> > +early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
> > +
> >  static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
> >  {
> >       return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
> > @@ -541,6 +557,11 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
> >       unsigned int order = huge_page_order(h);
> >       unsigned int vmemmap_pages;
> >
> > +     if (hugetlb_free_vmemmap_disabled) {
> > +             pr_info("disable free vmemmap pages for %s\n", h->name);
> > +             return;
> > +     }
> > +
> >       vmemmap_pages = ((1 << order) * sizeof(struct page)) >> PAGE_SHIFT;
> >       /*
> >        * The head page and the first tail page are not to be freed to buddy
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page
  2020-11-20  8:23   ` Michal Hocko
@ 2020-11-20 10:41     ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20 10:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:24 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:22, Muchun Song wrote:
> > Merge pte to huge pmd if it has ever been split. Now only support
> > gigantic page which's vmemmap pages size is an integer multiple of
> > PMD_SIZE. This is the simplest case to handle.
>
> I think it would be benefitial for anybody who plan to implement this
> for normal PMDs to document challenges while you still have them fresh
> in your mind.

Yeah, I agree with you. I will document it.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  arch/x86/include/asm/hugetlb.h |   8 +++
> >  mm/hugetlb_vmemmap.c           | 118 ++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 124 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> > index c601fe042832..1de1c519a84a 100644
> > --- a/arch/x86/include/asm/hugetlb.h
> > +++ b/arch/x86/include/asm/hugetlb.h
> > @@ -12,6 +12,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
> >  {
> >       return pmd_large(*pmd);
> >  }
> > +
> > +#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
> > +static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
> > +{
> > +     pte_t entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_LARGE);
> > +
> > +     return __pmd(pte_val(entry));
> > +}
> >  #endif
> >
> >  #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE)
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index c958699d1393..bf2b6b3e75af 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -144,6 +144,14 @@ static inline bool vmemmap_pmd_huge(pmd_t *pmd)
> >  }
> >  #endif
> >
> > +#ifndef vmemmap_pmd_mkhuge
> > +#define vmemmap_pmd_mkhuge vmemmap_pmd_mkhuge
> > +static inline pmd_t vmemmap_pmd_mkhuge(struct page *page)
> > +{
> > +     return pmd_mkhuge(mk_pmd(page, PAGE_KERNEL));
> > +}
> > +#endif
> > +
> >  static bool hugetlb_free_vmemmap_disabled __initdata;
> >
> >  static int __init early_hugetlb_free_vmemmap_param(char *buf)
> > @@ -422,6 +430,104 @@ static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
> >       }
> >  }
> >
> > +static void __replace_huge_page_pte_vmemmap(pte_t *ptep, unsigned long start,
> > +                                         unsigned int nr, struct page *huge,
> > +                                         struct list_head *free_pages)
> > +{
> > +     unsigned long addr;
> > +     unsigned long end = start + (nr << PAGE_SHIFT);
> > +     pgprot_t pgprot = PAGE_KERNEL;
> > +
> > +     for (addr = start; addr < end; addr += PAGE_SIZE, ptep++) {
> > +             struct page *page;
> > +             pte_t old = *ptep;
> > +             pte_t entry;
> > +
> > +             prepare_vmemmap_page(huge);
> > +
> > +             entry = mk_pte(huge++, pgprot);
> > +             VM_WARN_ON(!pte_present(old));
> > +             page = pte_page(old);
> > +             list_add(&page->lru, free_pages);
> > +
> > +             set_pte_at(&init_mm, addr, ptep, entry);
> > +     }
> > +}
> > +
> > +static void replace_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
> > +                                       struct page *huge,
> > +                                       struct list_head *free_pages)
> > +{
> > +     unsigned long end = start + VMEMMAP_HPAGE_SIZE;
> > +
> > +     flush_cache_vunmap(start, end);
> > +     __replace_huge_page_pte_vmemmap(pte_offset_kernel(pmd, start), start,
> > +                                     VMEMMAP_HPAGE_NR, huge, free_pages);
> > +     flush_tlb_kernel_range(start, end);
> > +}
> > +
> > +static pte_t *merge_vmemmap_pte(pmd_t *pmdp, unsigned long addr)
> > +{
> > +     pte_t *pte;
> > +     struct page *page;
> > +
> > +     pte = pte_offset_kernel(pmdp, addr);
> > +     page = pte_page(*pte);
> > +     set_pmd(pmdp, vmemmap_pmd_mkhuge(page));
> > +
> > +     return pte;
> > +}
> > +
> > +static void merge_huge_page_pmd_vmemmap(pmd_t *pmd, unsigned long start,
> > +                                     struct page *huge,
> > +                                     struct list_head *free_pages)
> > +{
> > +     replace_huge_page_pmd_vmemmap(pmd, start, huge, free_pages);
> > +     pte_free_kernel(&init_mm, merge_vmemmap_pte(pmd, start));
> > +     flush_tlb_kernel_range(start, start + VMEMMAP_HPAGE_SIZE);
> > +}
> > +
> > +static inline void dissolve_compound_page(struct page *page, unsigned int order)
> > +{
> > +     int i;
> > +     unsigned int nr_pages = 1 << order;
> > +
> > +     for (i = 1; i < nr_pages; i++)
> > +             set_page_count(page + i, 1);
> > +}
> > +
> > +static void merge_gigantic_page_vmemmap(struct hstate *h, struct page *head,
> > +                                     pmd_t *pmd)
> > +{
> > +     LIST_HEAD(free_pages);
> > +     unsigned long addr = (unsigned long)head;
> > +     unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
> > +
> > +     for (; addr < end; addr += VMEMMAP_HPAGE_SIZE) {
> > +             void *to;
> > +             struct page *page;
> > +
> > +             page = alloc_pages(GFP_VMEMMAP_PAGE & ~__GFP_NOFAIL,
> > +                                VMEMMAP_HPAGE_ORDER);
> > +             if (!page)
> > +                     goto out;
> > +
> > +             dissolve_compound_page(page, VMEMMAP_HPAGE_ORDER);
> > +             to = page_to_virt(page);
> > +             memcpy(to, (void *)addr, VMEMMAP_HPAGE_SIZE);
> > +
> > +             /*
> > +              * Make sure that any data that writes to the
> > +              * @to is made visible to the physical page.
> > +              */
> > +             flush_kernel_vmap_range(to, VMEMMAP_HPAGE_SIZE);
> > +
> > +             merge_huge_page_pmd_vmemmap(pmd++, addr, page, &free_pages);
> > +     }
> > +out:
> > +     free_vmemmap_page_list(&free_pages);
> > +}
> > +
> >  static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
> >  {
> >       int i;
> > @@ -454,10 +560,18 @@ void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
> >                                   __remap_huge_page_pte_vmemmap);
> >       if (!freed_vmemmap_hpage_dec(pmd_page(*pmd)) && pmd_split(pmd)) {
> >               /*
> > -              * Todo:
> > -              * Merge pte to huge pmd if it has ever been split.
> > +              * Merge pte to huge pmd if it has ever been split. Now only
> > +              * support gigantic page which's vmemmap pages size is an
> > +              * integer multiple of PMD_SIZE. This is the simplest case
> > +              * to handle.
> >                */
> >               clear_pmd_split(pmd);
> > +
> > +             if (IS_ALIGNED(vmemmap_pages_per_hpage(h), VMEMMAP_HPAGE_NR)) {
> > +                     spin_unlock(ptl);
> > +                     merge_gigantic_page_vmemmap(h, head, pmd);
> > +                     return;
> > +             }
> >       }
> >       spin_unlock(ptl);
> >  }
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  9:16   ` David Hildenbrand
@ 2020-11-20 10:42     ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20 10:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador, Michal Hocko,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 5:16 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.11.20 07:43, Muchun Song wrote:
> > We only can free the unused vmemmap to the buddy system when the
> > size of struct page is a power of two.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index c3b3fc041903..7bb749a3eea2 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -671,7 +671,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
> >       unsigned int order = huge_page_order(h);
> >       unsigned int vmemmap_pages;
> >
> > -     if (hugetlb_free_vmemmap_disabled) {
> > +     if (hugetlb_free_vmemmap_disabled ||
> > +         !is_power_of_2(sizeof(struct page))) {
> >               pr_info("disable free vmemmap pages for %s\n", h->name);
> >               return;
> >       }
> >
>
> This patch should be merged into the original patch that introduced
> vmemmap freeing.

Oh, yeah. Will do.

>
> --
> Thanks,
>
> David / dhildenb
>


-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20  9:37         ` Muchun Song
@ 2020-11-20 11:10           ` Michal Hocko
  2020-11-20 11:56             ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20 11:10 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 17:37:09, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > > [...]
> > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > > index eda7e3a0b67c..361c4174e222 100644
> > > > > --- a/mm/hugetlb_vmemmap.c
> > > > > +++ b/mm/hugetlb_vmemmap.c
> > > > > @@ -117,6 +117,8 @@
> > > > >  #define RESERVE_VMEMMAP_NR           2U
> > > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > > >  #define TAIL_PAGE_REUSE                      -1
> > > > > +#define GFP_VMEMMAP_PAGE             \
> > > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > > >
> > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > > destructive side effects. I believe it would be more reasonable to
> > > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > > sucks as well but it isn't destructive at least.
> > >
> > > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> > >
> > > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > > the caller guarantees the allocation will allow more memory to be freed
> > > very shortly.
> > >
> > > Our situation is in line with the description above. We will free a HugeTLB page
> > > to the buddy allocator which is much larger than that we allocated shortly.
> >
> > Yes that is a part of the description. But read it in its full entirety.
> >  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
> >  * the caller guarantees the allocation will allow more memory to be freed
> >  * very shortly e.g. process exiting or swapping. Users either should
> >  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> >  * Users of this flag have to be extremely careful to not deplete the reserve
> >  * completely and implement a throttling mechanism which controls the
> >  * consumption of the reserve based on the amount of freed memory.
> >  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> >  * before using this flag.
> >
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH
> 
> We want to free the HugeTLB page to the buddy allocator, but before that,
> we need to allocate some pages as vmemmap pages, so here we cannot
> handle allocation failures.

Why cannot you simply refuse to shrink the pool size?

> I think that we should replace the
> __GFP_RETRY_MAYFAIL to __GFP_NOFAIL.
> 
> GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH
> 
> This meets our needs here. Thanks.

Please read again my concern about the disruptive behavior or explain
why it is desirable.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
  2020-11-20 11:10           ` Michal Hocko
@ 2020-11-20 11:56             ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-20 11:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 7:10 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 17:37:09, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > > > [...]
> > > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > > > index eda7e3a0b67c..361c4174e222 100644
> > > > > > --- a/mm/hugetlb_vmemmap.c
> > > > > > +++ b/mm/hugetlb_vmemmap.c
> > > > > > @@ -117,6 +117,8 @@
> > > > > >  #define RESERVE_VMEMMAP_NR           2U
> > > > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > > > >  #define TAIL_PAGE_REUSE                      -1
> > > > > > +#define GFP_VMEMMAP_PAGE             \
> > > > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > > > >
> > > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > > > destructive side effects. I believe it would be more reasonable to
> > > > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > > > sucks as well but it isn't destructive at least.
> > > >
> > > > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> > > >
> > > > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > > > the caller guarantees the allocation will allow more memory to be freed
> > > > very shortly.
> > > >
> > > > Our situation is in line with the description above. We will free a HugeTLB page
> > > > to the buddy allocator which is much larger than that we allocated shortly.
> > >
> > > Yes that is a part of the description. But read it in its full entirety.
> > >  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > >  * the caller guarantees the allocation will allow more memory to be freed
> > >  * very shortly e.g. process exiting or swapping. Users either should
> > >  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> > >  * Users of this flag have to be extremely careful to not deplete the reserve
> > >  * completely and implement a throttling mechanism which controls the
> > >  * consumption of the reserve based on the amount of freed memory.
> > >  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> > >  * before using this flag.
> > >
> > > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH
> >
> > We want to free the HugeTLB page to the buddy allocator, but before that,
> > we need to allocate some pages as vmemmap pages, so here we cannot
> > handle allocation failures.
>
> Why cannot you simply refuse to shrink the pool size?
>
> > I think that we should replace the
> > __GFP_RETRY_MAYFAIL to __GFP_NOFAIL.
> >
> > GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH
> >
> > This meets our needs here. Thanks.
>
> Please read again my concern about the disruptive behavior or explain
> why it is desirable.

OK, I will come up with a solution which does not use the
__GFP_NOFAIL. Thanks.

>
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  8:42 ` [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Michal Hocko
  2020-11-20  9:27   ` David Hildenbrand
@ 2020-11-20 12:40   ` Muchun Song
  2020-11-20 13:11     ` Michal Hocko
  1 sibling, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20 12:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> [...]
>
> Thanks for improving the cover letter and providing some numbers. I have
> only glanced through the patchset because I didn't really have more time
> to dive depply into them.
>
> Overall it looks promissing. To summarize. I would prefer to not have
> the feature enablement controlled by compile time option and the kernel
> command line option should be opt-in. I also do not like that freeing
> the pool can trigger the oom killer or even shut the system down if no
> oom victim is eligible.

Hi Michal,

I have replied to you about those questions on the other mail thread.

Thanks.

>
> One thing that I didn't really get to think hard about is what is the
> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> invalid when racing with the split. How do we enforce that this won't
> blow up?

This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
in this case, the pfn_to_page can work. The return value of the
pfn_to_page is actually the address of it's struct page struct.
I can not figure out where the problem is. Can you describe the
problem in detail please? Thanks.

>
> I have also asked in a previous version whether the vmemmap manipulation
> should be really unconditional. E.g. shortlived hugetlb pages allocated
> from the buddy allocator directly rather than for a pool. Maybe it
> should be restricted for the pool allocation as those are considered
> long term and therefore the overhead will be amortized and freeing path
> restrictions better understandable.

Yeah, I agree with you. This can be an optimization. And we can
add it to the todo list and implement it in the future. Now the patch
series is already huge.

>
> >  Documentation/admin-guide/kernel-parameters.txt |   9 +
> >  Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
> >  arch/x86/include/asm/hugetlb.h                  |  17 +
> >  arch/x86/include/asm/pgtable_64_types.h         |   8 +
> >  arch/x86/mm/init_64.c                           |   7 +-
> >  fs/Kconfig                                      |  14 +
> >  include/linux/bootmem_info.h                    |  78 +++
> >  include/linux/hugetlb.h                         |  19 +
> >  include/linux/hugetlb_cgroup.h                  |  15 +-
> >  include/linux/memory_hotplug.h                  |  27 -
> >  mm/Makefile                                     |   2 +
> >  mm/bootmem_info.c                               | 124 ++++
> >  mm/hugetlb.c                                    | 163 ++++-
> >  mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
> >  mm/hugetlb_vmemmap.h                            | 103 ++++
>
> I will need to look closer but I suspect that a non-trivial part of the
> vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the
> split and remapping shouldn't really be hugetlb specific. Sure hugetlb
> knows how to split but all the splitting should be implemented in
> vmemmap proper.
>
> >  mm/memory_hotplug.c                             | 116 ----
> >  mm/sparse.c                                     |   5 +-
> >  17 files changed, 1295 insertions(+), 180 deletions(-)
> >  create mode 100644 include/linux/bootmem_info.h
> >  create mode 100644 mm/bootmem_info.c
> >  create mode 100644 mm/hugetlb_vmemmap.c
> >  create mode 100644 mm/hugetlb_vmemmap.h
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 12:40   ` [External] " Muchun Song
@ 2020-11-20 13:11     ` Michal Hocko
  2020-11-20 15:44       ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-20 13:11 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 20:40:46, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > [...]
> >
> > Thanks for improving the cover letter and providing some numbers. I have
> > only glanced through the patchset because I didn't really have more time
> > to dive depply into them.
> >
> > Overall it looks promissing. To summarize. I would prefer to not have
> > the feature enablement controlled by compile time option and the kernel
> > command line option should be opt-in. I also do not like that freeing
> > the pool can trigger the oom killer or even shut the system down if no
> > oom victim is eligible.
> 
> Hi Michal,
> 
> I have replied to you about those questions on the other mail thread.
> 
> Thanks.
> 
> >
> > One thing that I didn't really get to think hard about is what is the
> > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > invalid when racing with the split. How do we enforce that this won't
> > blow up?
> 
> This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> in this case, the pfn_to_page can work. The return value of the
> pfn_to_page is actually the address of it's struct page struct.
> I can not figure out where the problem is. Can you describe the
> problem in detail please? Thanks.

struct page returned by pfn_to_page might get invalid right when it is
returned because vmemmap could get freed up and the respective memory
released to the page allocator and reused for something else. See?

> > I have also asked in a previous version whether the vmemmap manipulation
> > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > from the buddy allocator directly rather than for a pool. Maybe it
> > should be restricted for the pool allocation as those are considered
> > long term and therefore the overhead will be amortized and freeing path
> > restrictions better understandable.
> 
> Yeah, I agree with you. This can be an optimization. And we can
> add it to the todo list and implement it in the future. Now the patch
> series is already huge.

Yes the patchset is large and the primary aim should be reducing
functionality to make it smaller in the first incarnation. Especially
when it is tricky to implement. Releasing vmemmap sparse hugepages is
one of those things. Do you really need it for your usecase?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 13:11     ` Michal Hocko
@ 2020-11-20 15:44       ` Muchun Song
  2020-11-23  7:40         ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-20 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > [...]
> > >
> > > Thanks for improving the cover letter and providing some numbers. I have
> > > only glanced through the patchset because I didn't really have more time
> > > to dive depply into them.
> > >
> > > Overall it looks promissing. To summarize. I would prefer to not have
> > > the feature enablement controlled by compile time option and the kernel
> > > command line option should be opt-in. I also do not like that freeing
> > > the pool can trigger the oom killer or even shut the system down if no
> > > oom victim is eligible.
> >
> > Hi Michal,
> >
> > I have replied to you about those questions on the other mail thread.
> >
> > Thanks.
> >
> > >
> > > One thing that I didn't really get to think hard about is what is the
> > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > invalid when racing with the split. How do we enforce that this won't
> > > blow up?
> >
> > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > in this case, the pfn_to_page can work. The return value of the
> > pfn_to_page is actually the address of it's struct page struct.
> > I can not figure out where the problem is. Can you describe the
> > problem in detail please? Thanks.
>
> struct page returned by pfn_to_page might get invalid right when it is
> returned because vmemmap could get freed up and the respective memory
> released to the page allocator and reused for something else. See?

If the HugeTLB page is already allocated from the buddy allocator,
the struct page of the HugeTLB can be freed? Does this exist?
If yes, how to free the HugeTLB page to the buddy allocator
(cannot access the struct page)?

>
> > > I have also asked in a previous version whether the vmemmap manipulation
> > > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > > from the buddy allocator directly rather than for a pool. Maybe it
> > > should be restricted for the pool allocation as those are considered
> > > long term and therefore the overhead will be amortized and freeing path
> > > restrictions better understandable.
> >
> > Yeah, I agree with you. This can be an optimization. And we can
> > add it to the todo list and implement it in the future. Now the patch
> > series is already huge.
>
> Yes the patchset is large and the primary aim should be reducing
> functionality to make it smaller in the first incarnation. Especially
> when it is tricky to implement. Releasing vmemmap sparse hugepages is
> one of those things. Do you really need it for your usecase?
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20  9:43       ` David Hildenbrand
@ 2020-11-20 17:45         ` Mike Kravetz
  2020-11-20 18:00           ` David Hildenbrand
                             ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Mike Kravetz @ 2020-11-20 17:45 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Muchun Song, corbet, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 11/20/20 1:43 AM, David Hildenbrand wrote:
> On 20.11.20 10:39, Michal Hocko wrote:
>> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>>> On 20.11.20 09:42, Michal Hocko wrote:
>>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>>> [...]
>>>>
>>>> Thanks for improving the cover letter and providing some numbers. I have
>>>> only glanced through the patchset because I didn't really have more time
>>>> to dive depply into them.
>>>>
>>>> Overall it looks promissing. To summarize. I would prefer to not have
>>>> the feature enablement controlled by compile time option and the kernel
>>>> command line option should be opt-in. I also do not like that freeing
>>>> the pool can trigger the oom killer or even shut the system down if no
>>>> oom victim is eligible.
>>>>
>>>> One thing that I didn't really get to think hard about is what is the
>>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>>> invalid when racing with the split. How do we enforce that this won't
>>>> blow up?
>>>
>>> I have the same concerns - the sections are online the whole time and
>>> anybody with pfn_to_online_page() can grab them
>>>
>>> I think we have similar issues with memory offlining when removing the
>>> vmemmap, it's just very hard to trigger and we can easily protect by
>>> grabbing the memhotplug lock.
>>
>> I am not sure we can/want to span memory hotplug locking out to all pfn
>> walkers. But you are right that the underlying problem is similar but
>> much harder to trigger because vmemmaps are only removed when the
>> physical memory is hotremoved and that happens very seldom. Maybe it
>> will happen more with virtualization usecases. But this work makes it
>> even more tricky. If a pfn walker races with a hotremove then it would
>> just blow up when accessing the unmapped physical address space. For
>> this feature a pfn walker would just grab a real struct page re-used for
>> some unpredictable use under its feet. Any failure would be silent and
>> hard to debug.
> 
> Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
> 
>>
>> [...]
>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>> then just let them stick around for all eternity.
>>
>> Not sure I understand. Do you propose to only free those vmemmap pages
>> when the pool is initialized during boot time and never allow to free
>> them up? That would certainly make it safer and maybe even simpler wrt
>> implementation.
> 
> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.

Not sure if I agree with that last statement.  Database and virtualization
use cases from my employer allocate allocate hugetlb pages after boot.  It
is shortly after boot, but still not from boot/kernel command line.

Somewhat related, but not exactly addressing this issue ...

One idea discussed in a previous patch set was to disable PMD/huge page
mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
of the complex code doing page table manipulation.  It does not address
the issue of struct page pages going away which is being discussed here,
but it could be a way to simply the first version of this code.  If this
is going to be an 'opt in' feature as previously suggested, then eliminating
the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
sysadmins would only 'opt in' if they expect most of system memory to be used
by hugetlb pages.  We certainly have database and virtualization use cases
where this is true.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 17:45         ` Mike Kravetz
@ 2020-11-20 18:00           ` David Hildenbrand
  2020-11-22  7:29           ` [External] " Muchun Song
  2020-11-23  7:38           ` Michal Hocko
  2 siblings, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2020-11-20 18:00 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: Muchun Song, corbet, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, willy, osalvador, song.bao.hua,
	duanxiongchun, linux-doc, linux-kernel, linux-mm, linux-fsdevel

On 20.11.20 18:45, Mike Kravetz wrote:
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
>> On 20.11.20 10:39, Michal Hocko wrote:
>>> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>>>> On 20.11.20 09:42, Michal Hocko wrote:
>>>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>>>> [...]
>>>>>
>>>>> Thanks for improving the cover letter and providing some numbers. I have
>>>>> only glanced through the patchset because I didn't really have more time
>>>>> to dive depply into them.
>>>>>
>>>>> Overall it looks promissing. To summarize. I would prefer to not have
>>>>> the feature enablement controlled by compile time option and the kernel
>>>>> command line option should be opt-in. I also do not like that freeing
>>>>> the pool can trigger the oom killer or even shut the system down if no
>>>>> oom victim is eligible.
>>>>>
>>>>> One thing that I didn't really get to think hard about is what is the
>>>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>>>> invalid when racing with the split. How do we enforce that this won't
>>>>> blow up?
>>>>
>>>> I have the same concerns - the sections are online the whole time and
>>>> anybody with pfn_to_online_page() can grab them
>>>>
>>>> I think we have similar issues with memory offlining when removing the
>>>> vmemmap, it's just very hard to trigger and we can easily protect by
>>>> grabbing the memhotplug lock.
>>>
>>> I am not sure we can/want to span memory hotplug locking out to all pfn
>>> walkers. But you are right that the underlying problem is similar but
>>> much harder to trigger because vmemmaps are only removed when the
>>> physical memory is hotremoved and that happens very seldom. Maybe it
>>> will happen more with virtualization usecases. But this work makes it
>>> even more tricky. If a pfn walker races with a hotremove then it would
>>> just blow up when accessing the unmapped physical address space. For
>>> this feature a pfn walker would just grab a real struct page re-used for
>>> some unpredictable use under its feet. Any failure would be silent and
>>> hard to debug.
>>
>> Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
>>
>>>
>>> [...]
>>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>>> then just let them stick around for all eternity.
>>>
>>> Not sure I understand. Do you propose to only free those vmemmap pages
>>> when the pool is initialized during boot time and never allow to free
>>> them up? That would certainly make it safer and maybe even simpler wrt
>>> implementation.
>>
>> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
> 
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.

Right, but the ones that care about this optimization for now could be 
converted, I assume? I mean we are talking about "opt-in" from 
sysadmins, so requiring to specify a different cmdline parameter does 
not sound to weird to me. And it should simplify a first version quite a 
lot.

The more I think about this, the more I believe doing these vmemmap 
modifications after boot are very dangerous.

> 
> Somewhat related, but not exactly addressing this issue ...
> 
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

It sounds like a hack to me, which does not fully solve the problem. But 
yeah, it's a simplification.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 17:45         ` Mike Kravetz
  2020-11-20 18:00           ` David Hildenbrand
@ 2020-11-22  7:29           ` Muchun Song
  2020-11-23  7:38           ` Michal Hocko
  2 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-22  7:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Michal Hocko, Jonathan Corbet,
	Thomas Gleixner, mingo, bp, x86, hpa, dave.hansen, luto,
	Peter Zijlstra, viro, Andrew Morton, paulmck, mchehab+huawei,
	pawan.kumar.gupta, Randy Dunlap, oneukum, anshuman.khandual,
	jroedel, Mina Almasry, David Rientjes, Matthew Wilcox,
	Oscar Salvador, Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
> > On 20.11.20 10:39, Michal Hocko wrote:
> >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
> >>> On 20.11.20 09:42, Michal Hocko wrote:
> >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> >>>> [...]
> >>>>
> >>>> Thanks for improving the cover letter and providing some numbers. I have
> >>>> only glanced through the patchset because I didn't really have more time
> >>>> to dive depply into them.
> >>>>
> >>>> Overall it looks promissing. To summarize. I would prefer to not have
> >>>> the feature enablement controlled by compile time option and the kernel
> >>>> command line option should be opt-in. I also do not like that freeing
> >>>> the pool can trigger the oom killer or even shut the system down if no
> >>>> oom victim is eligible.
> >>>>
> >>>> One thing that I didn't really get to think hard about is what is the
> >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> >>>> invalid when racing with the split. How do we enforce that this won't
> >>>> blow up?
> >>>
> >>> I have the same concerns - the sections are online the whole time and
> >>> anybody with pfn_to_online_page() can grab them
> >>>
> >>> I think we have similar issues with memory offlining when removing the
> >>> vmemmap, it's just very hard to trigger and we can easily protect by
> >>> grabbing the memhotplug lock.
> >>
> >> I am not sure we can/want to span memory hotplug locking out to all pfn
> >> walkers. But you are right that the underlying problem is similar but
> >> much harder to trigger because vmemmaps are only removed when the
> >> physical memory is hotremoved and that happens very seldom. Maybe it
> >> will happen more with virtualization usecases. But this work makes it
> >> even more tricky. If a pfn walker races with a hotremove then it would
> >> just blow up when accessing the unmapped physical address space. For
> >> this feature a pfn walker would just grab a real struct page re-used for
> >> some unpredictable use under its feet. Any failure would be silent and
> >> hard to debug.
> >
> > Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
> >
> >>
> >> [...]
> >>> To keep things easy, maybe simply never allow to free these hugetlb pages
> >>> again for now? If they were reserved during boot and the vmemmap condensed,
> >>> then just let them stick around for all eternity.
> >>
> >> Not sure I understand. Do you propose to only free those vmemmap pages
> >> when the pool is initialized during boot time and never allow to free
> >> them up? That would certainly make it safer and maybe even simpler wrt
> >> implementation.
> >
> > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
>
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.
>
> Somewhat related, but not exactly addressing this issue ...
>
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

Hi Mike,

Yeah, I agree with you that the first version of this feature should be
simply. I can do that (disable PMD/huge page mapping of vmemmap)
in the next version patch. But I have another question: what the
problem is when struct page pages go away? I have not understood
the issues discussed here, hope you can answer for me. Thanks.

> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  9:15     ` David Hildenbrand
@ 2020-11-22 13:30       ` Mike Rapoport
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Rapoport @ 2020-11-22 13:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Muchun Song, corbet, mike.kravetz, tglx, mingo, bp,
	x86, hpa, dave.hansen, luto, peterz, viro, akpm, paulmck,
	mchehab+huawei, pawan.kumar.gupta, rdunlap, oneukum,
	anshuman.khandual, jroedel, almasrymina, rientjes, willy,
	osalvador, song.bao.hua, duanxiongchun, linux-doc, linux-kernel,
	linux-mm, linux-fsdevel

On Fri, Nov 20, 2020 at 10:15:30AM +0100, David Hildenbrand wrote:
> On 20.11.20 09:25, Michal Hocko wrote:
> > On Fri 20-11-20 14:43:25, Muchun Song wrote:
> > > We only can free the unused vmemmap to the buddy system when the
> > > size of struct page is a power of two.
> > 
> > Can we actually have !power_of_2 struct pages?
> 
> AFAIK multiples of 8 bytes (56, 64, 72) are possible.

Or multiples of 4 for 32-bit (28, 32, 36). 
 
> -- 
> Thanks,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-20  8:25   ` Michal Hocko
  2020-11-20  9:15     ` David Hildenbrand
@ 2020-11-22 19:00     ` Matthew Wilcox
  2020-11-23  3:14       ` [External] " Muchun Song
  1 sibling, 1 reply; 77+ messages in thread
From: Matthew Wilcox @ 2020-11-22 19:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, corbet, mike.kravetz, tglx, mingo, bp, x86, hpa,
	dave.hansen, luto, peterz, viro, akpm, paulmck, mchehab+huawei,
	pawan.kumar.gupta, rdunlap, oneukum, anshuman.khandual, jroedel,
	almasrymina, rientjes, osalvador, song.bao.hua, duanxiongchun,
	linux-doc, linux-kernel, linux-mm, linux-fsdevel

On Fri, Nov 20, 2020 at 09:25:52AM +0100, Michal Hocko wrote:
> On Fri 20-11-20 14:43:25, Muchun Song wrote:
> > We only can free the unused vmemmap to the buddy system when the
> > size of struct page is a power of two.
> 
> Can we actually have !power_of_2 struct pages?

Yes.  On x86-64, if you don't enable MEMCG, it's 56 bytes.  On SPARC64,
if you do enable MEMCG, it's 72 bytes.  On 32-bit systems, it's
anything from 32-44 bytes, depending on MEMCG, WANT_PAGE_VIRTUAL and
LAST_CPUPID_NOT_IN_PAGE_FLAGS.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two
  2020-11-22 19:00     ` Matthew Wilcox
@ 2020-11-23  3:14       ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-23  3:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Jonathan Corbet, Mike Kravetz, Thomas Gleixner,
	mingo, bp, x86, hpa, dave.hansen, luto, Peter Zijlstra, viro,
	Andrew Morton, paulmck, mchehab+huawei, pawan.kumar.gupta,
	Randy Dunlap, oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Oscar Salvador, Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 3:00 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Nov 20, 2020 at 09:25:52AM +0100, Michal Hocko wrote:
> > On Fri 20-11-20 14:43:25, Muchun Song wrote:
> > > We only can free the unused vmemmap to the buddy system when the
> > > size of struct page is a power of two.
> >
> > Can we actually have !power_of_2 struct pages?
>
> Yes.  On x86-64, if you don't enable MEMCG, it's 56 bytes.  On SPARC64,
> if you do enable MEMCG, it's 72 bytes.  On 32-bit systems, it's
> anything from 32-44 bytes, depending on MEMCG, WANT_PAGE_VIRTUAL and
> LAST_CPUPID_NOT_IN_PAGE_FLAGS.
>

On x86-64, even if you do not enable MEMCG, it's also 64 bytes. Because
CONFIG_HAVE_ALIGNED_STRUCT_PAGE is defined if we use SLUB.



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 17:45         ` Mike Kravetz
  2020-11-20 18:00           ` David Hildenbrand
  2020-11-22  7:29           ` [External] " Muchun Song
@ 2020-11-23  7:38           ` Michal Hocko
  2020-11-23 21:52             ` Mike Kravetz
  2 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23  7:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, corbet, tglx, mingo, bp, x86,
	hpa, dave.hansen, luto, peterz, viro, akpm, paulmck,
	mchehab+huawei, pawan.kumar.gupta, rdunlap, oneukum,
	anshuman.khandual, jroedel, almasrymina, rientjes, willy,
	osalvador, song.bao.hua, duanxiongchun, linux-doc, linux-kernel,
	linux-mm, linux-fsdevel

On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
[...]
> >>> To keep things easy, maybe simply never allow to free these hugetlb pages
> >>> again for now? If they were reserved during boot and the vmemmap condensed,
> >>> then just let them stick around for all eternity.
> >>
> >> Not sure I understand. Do you propose to only free those vmemmap pages
> >> when the pool is initialized during boot time and never allow to free
> >> them up? That would certainly make it safer and maybe even simpler wrt
> >> implementation.
> > 
> > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
> 
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.

Is there any strong reason for that?

> Somewhat related, but not exactly addressing this issue ...
> 
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

Would this simplify the code considerably? I mean, the vmemmap page
tables will need to be updated anyway. So that code has to stay. PMD
entry split shouldn't be the most complex part of that operation.  On
the other hand dropping large pages for all vmemmaps will likely have a
performance.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-20 15:44       ` Muchun Song
@ 2020-11-23  7:40         ` Michal Hocko
  2020-11-23  8:53           ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23  7:40 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 23:44:26, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > [...]
> > > >
> > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > only glanced through the patchset because I didn't really have more time
> > > > to dive depply into them.
> > > >
> > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > the feature enablement controlled by compile time option and the kernel
> > > > command line option should be opt-in. I also do not like that freeing
> > > > the pool can trigger the oom killer or even shut the system down if no
> > > > oom victim is eligible.
> > >
> > > Hi Michal,
> > >
> > > I have replied to you about those questions on the other mail thread.
> > >
> > > Thanks.
> > >
> > > >
> > > > One thing that I didn't really get to think hard about is what is the
> > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > invalid when racing with the split. How do we enforce that this won't
> > > > blow up?
> > >
> > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > in this case, the pfn_to_page can work. The return value of the
> > > pfn_to_page is actually the address of it's struct page struct.
> > > I can not figure out where the problem is. Can you describe the
> > > problem in detail please? Thanks.
> >
> > struct page returned by pfn_to_page might get invalid right when it is
> > returned because vmemmap could get freed up and the respective memory
> > released to the page allocator and reused for something else. See?
> 
> If the HugeTLB page is already allocated from the buddy allocator,
> the struct page of the HugeTLB can be freed? Does this exist?

Nope, struct pages only ever get deallocated when the respective memory
(they describe) is hotremoved via hotplug.

> If yes, how to free the HugeTLB page to the buddy allocator
> (cannot access the struct page)?

But I do not follow how that relates to my concern above.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-20  9:30     ` [External] " Muchun Song
@ 2020-11-23  7:48       ` Michal Hocko
  2020-11-23  8:01         ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23  7:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Fri 20-11-20 17:30:27, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:16 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 14:43:17, Muchun Song wrote:
> > > When we allocate hugetlb page from buddy, we may need split huge pmd
> > > to pte. When we free the hugetlb page, we can merge pte to pmd. So
> > > we need to distinguish whether the previous pmd has been split. The
> > > page table is not allocated from slab. So we can reuse the PG_slab
> > > to indicate that the pmd has been split.
> >
> > PageSlab is used outside of the slab allocator proper and that code
> > might get confused by this AFAICS.
> 
> I got your concerns. Maybe we can use PG_private instead of the
> PG_slab.

Reusing a page flag arbitrarily is not that easy. Hugetlb pages have a
lot of spare room in struct page so I would rather use something else.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd
  2020-11-23  7:48       ` Michal Hocko
@ 2020-11-23  8:01         ` Muchun Song
  0 siblings, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-23  8:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 3:48 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 17:30:27, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:16 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 14:43:17, Muchun Song wrote:
> > > > When we allocate hugetlb page from buddy, we may need split huge pmd
> > > > to pte. When we free the hugetlb page, we can merge pte to pmd. So
> > > > we need to distinguish whether the previous pmd has been split. The
> > > > page table is not allocated from slab. So we can reuse the PG_slab
> > > > to indicate that the pmd has been split.
> > >
> > > PageSlab is used outside of the slab allocator proper and that code
> > > might get confused by this AFAICS.
> >
> > I got your concerns. Maybe we can use PG_private instead of the
> > PG_slab.
>
> Reusing a page flag arbitrarily is not that easy. Hugetlb pages have a
> lot of spare room in struct page so I would rather use something else.

This page is the PMD page table of vmemmap, not the vmemmap page
of HugeTLB. And the page table does not use PG_private. Maybe it is
enough. Thanks.

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23  7:40         ` Michal Hocko
@ 2020-11-23  8:53           ` Muchun Song
  2020-11-23  9:43             ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-23  8:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > [...]
> > > > >
> > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > only glanced through the patchset because I didn't really have more time
> > > > > to dive depply into them.
> > > > >
> > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > the feature enablement controlled by compile time option and the kernel
> > > > > command line option should be opt-in. I also do not like that freeing
> > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > oom victim is eligible.
> > > >
> > > > Hi Michal,
> > > >
> > > > I have replied to you about those questions on the other mail thread.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > One thing that I didn't really get to think hard about is what is the
> > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > blow up?
> > > >
> > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > in this case, the pfn_to_page can work. The return value of the
> > > > pfn_to_page is actually the address of it's struct page struct.
> > > > I can not figure out where the problem is. Can you describe the
> > > > problem in detail please? Thanks.
> > >
> > > struct page returned by pfn_to_page might get invalid right when it is
> > > returned because vmemmap could get freed up and the respective memory
> > > released to the page allocator and reused for something else. See?
> >
> > If the HugeTLB page is already allocated from the buddy allocator,
> > the struct page of the HugeTLB can be freed? Does this exist?
>
> Nope, struct pages only ever get deallocated when the respective memory
> (they describe) is hotremoved via hotplug.
>
> > If yes, how to free the HugeTLB page to the buddy allocator
> > (cannot access the struct page)?
>
> But I do not follow how that relates to my concern above.

Sorry. I shouldn't understand your concerns.

vmemmap pages                 page frame
+-----------+   mapping to   +-----------+
|           | -------------> |     0     |
+-----------+                +-----------+
|           | -------------> |     1     |
+-----------+                +-----------+
|           | -------------> |     2     |
+-----------+                +-----------+
|           | -------------> |     3     |
+-----------+                +-----------+
|           | -------------> |     4     |
+-----------+                +-----------+
|           | -------------> |     5     |
+-----------+                +-----------+
|           | -------------> |     6     |
+-----------+                +-----------+
|           | -------------> |     7     |
+-----------+                +-----------+

In this patch series, we will free the page frame 2-7 to the
buddy allocator. You mean that pfn_to_page can return invalid
value when the pfn is the page frame 2-7? Thanks.

>
> --
> Michal Hocko
> SUSE Labs



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23  8:53           ` Muchun Song
@ 2020-11-23  9:43             ` Michal Hocko
  2020-11-23 10:36               ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23  9:43 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 16:53:53, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > [...]
> > > > > >
> > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > to dive depply into them.
> > > > > >
> > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > oom victim is eligible.
> > > > >
> > > > > Hi Michal,
> > > > >
> > > > > I have replied to you about those questions on the other mail thread.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > blow up?
> > > > >
> > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > I can not figure out where the problem is. Can you describe the
> > > > > problem in detail please? Thanks.
> > > >
> > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > returned because vmemmap could get freed up and the respective memory
> > > > released to the page allocator and reused for something else. See?
> > >
> > > If the HugeTLB page is already allocated from the buddy allocator,
> > > the struct page of the HugeTLB can be freed? Does this exist?
> >
> > Nope, struct pages only ever get deallocated when the respective memory
> > (they describe) is hotremoved via hotplug.
> >
> > > If yes, how to free the HugeTLB page to the buddy allocator
> > > (cannot access the struct page)?
> >
> > But I do not follow how that relates to my concern above.
> 
> Sorry. I shouldn't understand your concerns.
> 
> vmemmap pages                 page frame
> +-----------+   mapping to   +-----------+
> |           | -------------> |     0     |
> +-----------+                +-----------+
> |           | -------------> |     1     |
> +-----------+                +-----------+
> |           | -------------> |     2     |
> +-----------+                +-----------+
> |           | -------------> |     3     |
> +-----------+                +-----------+
> |           | -------------> |     4     |
> +-----------+                +-----------+
> |           | -------------> |     5     |
> +-----------+                +-----------+
> |           | -------------> |     6     |
> +-----------+                +-----------+
> |           | -------------> |     7     |
> +-----------+                +-----------+
> 
> In this patch series, we will free the page frame 2-7 to the
> buddy allocator. You mean that pfn_to_page can return invalid
> value when the pfn is the page frame 2-7? Thanks.

No I really mean that pfn_to_page will give you a struct page pointer
from pages which you release from the vmemmap page tables. Those pages
might get reused as soon sa they are freed to the page allocator.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23  9:43             ` Michal Hocko
@ 2020-11-23 10:36               ` Muchun Song
  2020-11-23 10:42                 ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-23 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > [...]
> > > > > > >
> > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > to dive depply into them.
> > > > > > >
> > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > oom victim is eligible.
> > > > > >
> > > > > > Hi Michal,
> > > > > >
> > > > > > I have replied to you about those questions on the other mail thread.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > >
> > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > blow up?
> > > > > >
> > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > problem in detail please? Thanks.
> > > > >
> > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > returned because vmemmap could get freed up and the respective memory
> > > > > released to the page allocator and reused for something else. See?
> > > >
> > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > the struct page of the HugeTLB can be freed? Does this exist?
> > >
> > > Nope, struct pages only ever get deallocated when the respective memory
> > > (they describe) is hotremoved via hotplug.
> > >
> > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > (cannot access the struct page)?
> > >
> > > But I do not follow how that relates to my concern above.
> >
> > Sorry. I shouldn't understand your concerns.
> >
> > vmemmap pages                 page frame
> > +-----------+   mapping to   +-----------+
> > |           | -------------> |     0     |
> > +-----------+                +-----------+
> > |           | -------------> |     1     |
> > +-----------+                +-----------+
> > |           | -------------> |     2     |
> > +-----------+                +-----------+
> > |           | -------------> |     3     |
> > +-----------+                +-----------+
> > |           | -------------> |     4     |
> > +-----------+                +-----------+
> > |           | -------------> |     5     |
> > +-----------+                +-----------+
> > |           | -------------> |     6     |
> > +-----------+                +-----------+
> > |           | -------------> |     7     |
> > +-----------+                +-----------+
> >
> > In this patch series, we will free the page frame 2-7 to the
> > buddy allocator. You mean that pfn_to_page can return invalid
> > value when the pfn is the page frame 2-7? Thanks.
>
> No I really mean that pfn_to_page will give you a struct page pointer
> from pages which you release from the vmemmap page tables. Those pages
> might get reused as soon sa they are freed to the page allocator.

We will remap vmemmap pages 2-7 (virtual addresses) to page
frame 1. And then we free page frame 2-7 to the buddy allocator.
Then accessing the struct page pointer that returned by pfn_to_page
will be reflected on page frame 1. I think that here is no problem.

Thanks.

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 10:36               ` Muchun Song
@ 2020-11-23 10:42                 ` Michal Hocko
  2020-11-23 11:16                   ` Muchun Song
  2020-11-23 12:45                   ` Matthew Wilcox
  0 siblings, 2 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-23 10:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 18:36:33, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > to dive depply into them.
> > > > > > > >
> > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > oom victim is eligible.
> > > > > > >
> > > > > > > Hi Michal,
> > > > > > >
> > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > >
> > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > blow up?
> > > > > > >
> > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > problem in detail please? Thanks.
> > > > > >
> > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > released to the page allocator and reused for something else. See?
> > > > >
> > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > >
> > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > (they describe) is hotremoved via hotplug.
> > > >
> > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > (cannot access the struct page)?
> > > >
> > > > But I do not follow how that relates to my concern above.
> > >
> > > Sorry. I shouldn't understand your concerns.
> > >
> > > vmemmap pages                 page frame
> > > +-----------+   mapping to   +-----------+
> > > |           | -------------> |     0     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     1     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     2     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     3     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     4     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     5     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     6     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     7     |
> > > +-----------+                +-----------+
> > >
> > > In this patch series, we will free the page frame 2-7 to the
> > > buddy allocator. You mean that pfn_to_page can return invalid
> > > value when the pfn is the page frame 2-7? Thanks.
> >
> > No I really mean that pfn_to_page will give you a struct page pointer
> > from pages which you release from the vmemmap page tables. Those pages
> > might get reused as soon sa they are freed to the page allocator.
> 
> We will remap vmemmap pages 2-7 (virtual addresses) to page
> frame 1. And then we free page frame 2-7 to the buddy allocator.

And this doesn't really happen in an atomic fashion from the pfn walker
POV, right? So it is very well possible that 

struct page *page = pfn_to_page();
// remapping happens here
// page content is no longer valid because its backing memory can be
// reused for whatever purpose.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 10:42                 ` Michal Hocko
@ 2020-11-23 11:16                   ` Muchun Song
  2020-11-23 11:32                     ` Michal Hocko
  2020-11-23 12:45                   ` Matthew Wilcox
  1 sibling, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-23 11:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > to dive depply into them.
> > > > > > > > >
> > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > oom victim is eligible.
> > > > > > > >
> > > > > > > > Hi Michal,
> > > > > > > >
> > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > blow up?
> > > > > > > >
> > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > problem in detail please? Thanks.
> > > > > > >
> > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > released to the page allocator and reused for something else. See?
> > > > > >
> > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > >
> > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > (they describe) is hotremoved via hotplug.
> > > > >
> > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > (cannot access the struct page)?
> > > > >
> > > > > But I do not follow how that relates to my concern above.
> > > >
> > > > Sorry. I shouldn't understand your concerns.
> > > >
> > > > vmemmap pages                 page frame
> > > > +-----------+   mapping to   +-----------+
> > > > |           | -------------> |     0     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     1     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     2     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     3     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     4     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     5     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     6     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     7     |
> > > > +-----------+                +-----------+
> > > >
> > > > In this patch series, we will free the page frame 2-7 to the
> > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > value when the pfn is the page frame 2-7? Thanks.
> > >
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> >
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
>
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that

Yeah, you are right. But it may not be a problem for HugeTLB pages.
Because in most cases, we only read the tail struct page and get the
head struct page through compound_head() when the pfn is within
a HugeTLB range. Right?

>
> struct page *page = pfn_to_page();
> // remapping happens here
> // page content is no longer valid because its backing memory can be

If we only read the page->compound_head. The content is
also valid. Because the value of compound_head is the same
for the tail page struct of HugeTLB page.

> // reused for whatever purpose.

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 11:16                   ` Muchun Song
@ 2020-11-23 11:32                     ` Michal Hocko
  2020-11-23 12:07                       ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23 11:32 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 19:16:18, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > > to dive depply into them.
> > > > > > > > > >
> > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > > oom victim is eligible.
> > > > > > > > >
> > > > > > > > > Hi Michal,
> > > > > > > > >
> > > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > > blow up?
> > > > > > > > >
> > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > > problem in detail please? Thanks.
> > > > > > > >
> > > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > > released to the page allocator and reused for something else. See?
> > > > > > >
> > > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > >
> > > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > > (they describe) is hotremoved via hotplug.
> > > > > >
> > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > (cannot access the struct page)?
> > > > > >
> > > > > > But I do not follow how that relates to my concern above.
> > > > >
> > > > > Sorry. I shouldn't understand your concerns.
> > > > >
> > > > > vmemmap pages                 page frame
> > > > > +-----------+   mapping to   +-----------+
> > > > > |           | -------------> |     0     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     1     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     2     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     3     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     4     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     5     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     6     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     7     |
> > > > > +-----------+                +-----------+
> > > > >
> > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > value when the pfn is the page frame 2-7? Thanks.
> > > >
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> >
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that
> 
> Yeah, you are right. But it may not be a problem for HugeTLB pages.
> Because in most cases, we only read the tail struct page and get the
> head struct page through compound_head() when the pfn is within
> a HugeTLB range. Right?

Many pfn walkers would encounter the head page first and then skip over
the rest. Those should be reasonably safe. But there is no guarantee and
the fact that you need a valid page->compound_head which might get
scribbled over once you have the struct page makes this extremely
subtle.

-- 

SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 11:32                     ` Michal Hocko
@ 2020-11-23 12:07                       ` Muchun Song
  2020-11-23 12:18                         ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-23 12:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 19:16:18, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > > > to dive depply into them.
> > > > > > > > > > >
> > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > > > oom victim is eligible.
> > > > > > > > > >
> > > > > > > > > > Hi Michal,
> > > > > > > > > >
> > > > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > > > blow up?
> > > > > > > > > >
> > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > > > problem in detail please? Thanks.
> > > > > > > > >
> > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > > > released to the page allocator and reused for something else. See?
> > > > > > > >
> > > > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > > >
> > > > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > > > (they describe) is hotremoved via hotplug.
> > > > > > >
> > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > > (cannot access the struct page)?
> > > > > > >
> > > > > > > But I do not follow how that relates to my concern above.
> > > > > >
> > > > > > Sorry. I shouldn't understand your concerns.
> > > > > >
> > > > > > vmemmap pages                 page frame
> > > > > > +-----------+   mapping to   +-----------+
> > > > > > |           | -------------> |     0     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     1     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     2     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     3     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     4     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     5     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     6     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     7     |
> > > > > > +-----------+                +-----------+
> > > > > >
> > > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > > value when the pfn is the page frame 2-7? Thanks.
> > > > >
> > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > might get reused as soon sa they are freed to the page allocator.
> > > >
> > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > >
> > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > POV, right? So it is very well possible that
> >
> > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > Because in most cases, we only read the tail struct page and get the
> > head struct page through compound_head() when the pfn is within
> > a HugeTLB range. Right?
>
> Many pfn walkers would encounter the head page first and then skip over
> the rest. Those should be reasonably safe. But there is no guarantee and
> the fact that you need a valid page->compound_head which might get
> scribbled over once you have the struct page makes this extremely
> subtle.

In this patch series, we can guarantee that the page->compound_head
is always valid. Because we reuse the first tail page. Maybe you need to
look closer at this series. Thanks.


>
> --
>
> SUSE Labs



--
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 12:07                       ` Muchun Song
@ 2020-11-23 12:18                         ` Michal Hocko
  2020-11-23 12:40                           ` Muchun Song
  0 siblings, 1 reply; 77+ messages in thread
From: Michal Hocko @ 2020-11-23 12:18 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 20:07:23, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > >
> > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > >
> > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > POV, right? So it is very well possible that
> > >
> > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > Because in most cases, we only read the tail struct page and get the
> > > head struct page through compound_head() when the pfn is within
> > > a HugeTLB range. Right?
> >
> > Many pfn walkers would encounter the head page first and then skip over
> > the rest. Those should be reasonably safe. But there is no guarantee and
> > the fact that you need a valid page->compound_head which might get
> > scribbled over once you have the struct page makes this extremely
> > subtle.
> 
> In this patch series, we can guarantee that the page->compound_head
> is always valid. Because we reuse the first tail page. Maybe you need to
> look closer at this series. Thanks.

I must be really terrible exaplaining my concern. Let me try one last
time. It is really _irrelevant_ what you do with tail pages. The
underlying problem is that you are changing struct pages under users
without any synchronization. What used to be a valid struct page will
turn into garbage as soon as you remap vmemmap page tables.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 12:18                         ` Michal Hocko
@ 2020-11-23 12:40                           ` Muchun Song
  2020-11-23 12:48                             ` Michal Hocko
  0 siblings, 1 reply; 77+ messages in thread
From: Muchun Song @ 2020-11-23 12:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > > >
> > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > >
> > > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > > POV, right? So it is very well possible that
> > > >
> > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > Because in most cases, we only read the tail struct page and get the
> > > > head struct page through compound_head() when the pfn is within
> > > > a HugeTLB range. Right?
> > >
> > > Many pfn walkers would encounter the head page first and then skip over
> > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > the fact that you need a valid page->compound_head which might get
> > > scribbled over once you have the struct page makes this extremely
> > > subtle.
> >
> > In this patch series, we can guarantee that the page->compound_head
> > is always valid. Because we reuse the first tail page. Maybe you need to
> > look closer at this series. Thanks.
>
> I must be really terrible exaplaining my concern. Let me try one last
> time. It is really _irrelevant_ what you do with tail pages. The
> underlying problem is that you are changing struct pages under users
> without any synchronization. What used to be a valid struct page will
> turn into garbage as soon as you remap vmemmap page tables.

Thank you very much for your patient explanation. So if the pfn walkers
always try get the head struct page through compound_head() when it
encounter a tail struct page. There will be no concerns. Do you agree?

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 10:42                 ` Michal Hocko
  2020-11-23 11:16                   ` Muchun Song
@ 2020-11-23 12:45                   ` Matthew Wilcox
  2020-11-23 13:05                     ` Muchun Song
  2020-11-23 13:13                     ` Michal Hocko
  1 sibling, 2 replies; 77+ messages in thread
From: Matthew Wilcox @ 2020-11-23 12:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, Jonathan Corbet, Mike Kravetz, Thomas Gleixner,
	mingo, bp, x86, hpa, dave.hansen, luto, Peter Zijlstra, viro,
	Andrew Morton, paulmck, mchehab+huawei, pawan.kumar.gupta,
	Randy Dunlap, oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Oscar Salvador, Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> > 
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
> 
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that 
> 
> struct page *page = pfn_to_page();
> // remapping happens here
> // page content is no longer valid because its backing memory can be
> // reused for whatever purpose.

pfn_to_page() returns you a virtual address.  That virtual address
remains a valid pointer to exactly the same contents, it's just that
the page tables change to point to a different struct page which has
the same compound_head().

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 12:40                           ` Muchun Song
@ 2020-11-23 12:48                             ` Michal Hocko
  0 siblings, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-23 12:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Matthew Wilcox, Oscar Salvador,
	Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 20:40:40, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > > > >
> > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > > >
> > > > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > > > POV, right? So it is very well possible that
> > > > >
> > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > > Because in most cases, we only read the tail struct page and get the
> > > > > head struct page through compound_head() when the pfn is within
> > > > > a HugeTLB range. Right?
> > > >
> > > > Many pfn walkers would encounter the head page first and then skip over
> > > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > > the fact that you need a valid page->compound_head which might get
> > > > scribbled over once you have the struct page makes this extremely
> > > > subtle.
> > >
> > > In this patch series, we can guarantee that the page->compound_head
> > > is always valid. Because we reuse the first tail page. Maybe you need to
> > > look closer at this series. Thanks.
> >
> > I must be really terrible exaplaining my concern. Let me try one last
> > time. It is really _irrelevant_ what you do with tail pages. The
> > underlying problem is that you are changing struct pages under users
> > without any synchronization. What used to be a valid struct page will
> > turn into garbage as soon as you remap vmemmap page tables.
> 
> Thank you very much for your patient explanation. So if the pfn walkers
> always try get the head struct page through compound_head() when it
> encounter a tail struct page. There will be no concerns. Do you agree?

No, I do not agree. Please read again. The content of the struct page
might be a complete garbage at any time after pfn_to_page returns a
struct page. So there is no valid compound_head anywamore.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 12:45                   ` Matthew Wilcox
@ 2020-11-23 13:05                     ` Muchun Song
  2020-11-23 13:13                     ` Michal Hocko
  1 sibling, 0 replies; 77+ messages in thread
From: Muchun Song @ 2020-11-23 13:05 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko
  Cc: Jonathan Corbet, Mike Kravetz, Thomas Gleixner, mingo, bp, x86,
	hpa, dave.hansen, luto, Peter Zijlstra, viro, Andrew Morton,
	paulmck, mchehab+huawei, pawan.kumar.gupta, Randy Dunlap,
	oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Oscar Salvador, Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon, Nov 23, 2020 at 8:45 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> >
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that
> >
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
>
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

I agree with you.

Hi Michal,

Maybe you need to look at this.

-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 12:45                   ` Matthew Wilcox
  2020-11-23 13:05                     ` Muchun Song
@ 2020-11-23 13:13                     ` Michal Hocko
  1 sibling, 0 replies; 77+ messages in thread
From: Michal Hocko @ 2020-11-23 13:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Muchun Song, Jonathan Corbet, Mike Kravetz, Thomas Gleixner,
	mingo, bp, x86, hpa, dave.hansen, luto, Peter Zijlstra, viro,
	Andrew Morton, paulmck, mchehab+huawei, pawan.kumar.gupta,
	Randy Dunlap, oneukum, anshuman.khandual, jroedel, Mina Almasry,
	David Rientjes, Oscar Salvador, Song Bao Hua (Barry Song),
	Xiongchun duan, linux-doc, LKML, Linux Memory Management List,
	linux-fsdevel

On Mon 23-11-20 12:45:13, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > > 
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > 
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that 
> > 
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
> 
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

You are right. I have managed to completely confuse myself. Sorry about
the noise!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23  7:38           ` Michal Hocko
@ 2020-11-23 21:52             ` Mike Kravetz
  2020-11-23 22:01               ` Matthew Wilcox
  0 siblings, 1 reply; 77+ messages in thread
From: Mike Kravetz @ 2020-11-23 21:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Muchun Song, corbet, tglx, mingo, bp, x86,
	hpa, dave.hansen, luto, peterz, viro, akpm, paulmck,
	mchehab+huawei, pawan.kumar.gupta, rdunlap, oneukum,
	anshuman.khandual, jroedel, almasrymina, rientjes, willy,
	osalvador, song.bao.hua, duanxiongchun, linux-doc, linux-kernel,
	linux-mm, linux-fsdevel

On 11/22/20 11:38 PM, Michal Hocko wrote:
> On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
>> On 11/20/20 1:43 AM, David Hildenbrand wrote:
> [...]
>>>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>>>> then just let them stick around for all eternity.
>>>>
>>>> Not sure I understand. Do you propose to only free those vmemmap pages
>>>> when the pool is initialized during boot time and never allow to free
>>>> them up? That would certainly make it safer and maybe even simpler wrt
>>>> implementation.
>>>
>>> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
>>
>> Not sure if I agree with that last statement.  Database and virtualization
>> use cases from my employer allocate allocate hugetlb pages after boot.  It
>> is shortly after boot, but still not from boot/kernel command line.
> 
> Is there any strong reason for that?
> 

The reason I have been given is that it is preferable to have SW compute
the number of needed huge pages after boot based on total memory, rather
than have a sysadmin calculate the number and add a boot parameter.

>> Somewhat related, but not exactly addressing this issue ...
>>
>> One idea discussed in a previous patch set was to disable PMD/huge page
>> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
>> of the complex code doing page table manipulation.  It does not address
>> the issue of struct page pages going away which is being discussed here,
>> but it could be a way to simply the first version of this code.  If this
>> is going to be an 'opt in' feature as previously suggested, then eliminating
>> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
>> sysadmins would only 'opt in' if they expect most of system memory to be used
>> by hugetlb pages.  We certainly have database and virtualization use cases
>> where this is true.
> 
> Would this simplify the code considerably? I mean, the vmemmap page
> tables will need to be updated anyway. So that code has to stay. PMD
> entry split shouldn't be the most complex part of that operation.  On
> the other hand dropping large pages for all vmemmaps will likely have a
> performance.

I agree with your points.  This was just one way in which the patch set
could be simplified.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page
  2020-11-23 21:52             ` Mike Kravetz
@ 2020-11-23 22:01               ` Matthew Wilcox
  0 siblings, 0 replies; 77+ messages in thread
From: Matthew Wilcox @ 2020-11-23 22:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, David Hildenbrand, Muchun Song, corbet, tglx,
	mingo, bp, x86, hpa, dave.hansen, luto, peterz, viro, akpm,
	paulmck, mchehab+huawei, pawan.kumar.gupta, rdunlap, oneukum,
	anshuman.khandual, jroedel, almasrymina, rientjes, osalvador,
	song.bao.hua, duanxiongchun, linux-doc, linux-kernel, linux-mm,
	linux-fsdevel

On Mon, Nov 23, 2020 at 01:52:13PM -0800, Mike Kravetz wrote:
> On 11/22/20 11:38 PM, Michal Hocko wrote:
> > On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
> >> Not sure if I agree with that last statement.  Database and virtualization
> >> use cases from my employer allocate allocate hugetlb pages after boot.  It
> >> is shortly after boot, but still not from boot/kernel command line.
> > 
> > Is there any strong reason for that?
> 
> The reason I have been given is that it is preferable to have SW compute
> the number of needed huge pages after boot based on total memory, rather
> than have a sysadmin calculate the number and add a boot parameter.

Oh, I remember this bug!  I think it was posted publically, even.
If the sysadmin configures, say, 90% of the RAM to be hugepages and
then a DIMM fails and the sysadmin doesn't remember to adjust the boot
parameter, Linux does some pretty horrible things and the symptom is
"Linux doesn't boot".


^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2020-11-23 22:02 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20  6:43 [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Muchun Song
2020-11-20  6:43 ` [PATCH v5 01/21] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c Muchun Song
2020-11-20  6:43 ` [PATCH v5 02/21] mm/memory_hotplug: Move {get,put}_page_bootmem() " Muchun Song
2020-11-20  6:43 ` [PATCH v5 03/21] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
2020-11-20  7:49   ` Michal Hocko
2020-11-20  8:35     ` [External] " Muchun Song
2020-11-20  8:47       ` Michal Hocko
2020-11-20  8:53         ` Muchun Song
2020-11-20  6:43 ` [PATCH v5 04/21] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
2020-11-20  6:43 ` [PATCH v5 05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers Muchun Song
2020-11-20  6:43 ` [PATCH v5 06/21] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page() Muchun Song
2020-11-20  6:43 ` [PATCH v5 07/21] mm/bootmem_info: Combine bootmem info and type into page->freelist Muchun Song
2020-11-20  6:43 ` [PATCH v5 08/21] mm/hugetlb: Initialize page table lock for vmemmap Muchun Song
2020-11-20  6:43 ` [PATCH v5 09/21] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page Muchun Song
2020-11-20  6:43 ` [PATCH v5 10/21] mm/hugetlb: Defer freeing of hugetlb pages Muchun Song
2020-11-20  6:43 ` [PATCH v5 11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page Muchun Song
2020-11-20  8:11   ` Michal Hocko
2020-11-20  8:51     ` [External] " Muchun Song
2020-11-20  9:28       ` Michal Hocko
2020-11-20  9:37         ` Muchun Song
2020-11-20 11:10           ` Michal Hocko
2020-11-20 11:56             ` Muchun Song
2020-11-20  6:43 ` [PATCH v5 12/21] mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper Muchun Song
2020-11-20  6:43 ` [PATCH v5 13/21] mm/hugetlb: Use PG_slab to indicate split pmd Muchun Song
2020-11-20  8:16   ` Michal Hocko
2020-11-20  9:30     ` [External] " Muchun Song
2020-11-23  7:48       ` Michal Hocko
2020-11-23  8:01         ` Muchun Song
2020-11-20  6:43 ` [PATCH v5 14/21] mm/hugetlb: Support freeing vmemmap pages of gigantic page Muchun Song
2020-11-20  6:43 ` [PATCH v5 15/21] mm/hugetlb: Set the PageHWPoison to the raw error page Muchun Song
2020-11-20  8:19   ` Michal Hocko
2020-11-20 10:32     ` [External] " Muchun Song
2020-11-20  6:43 ` [PATCH v5 16/21] mm/hugetlb: Flush work when dissolving hugetlb page Muchun Song
2020-11-20  8:20   ` Michal Hocko
2020-11-20  6:43 ` [PATCH v5 17/21] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap Muchun Song
2020-11-20  8:22   ` Michal Hocko
2020-11-20 10:39     ` [External] " Muchun Song
2020-11-20  6:43 ` [PATCH v5 18/21] mm/hugetlb: Merge pte to huge pmd only for gigantic page Muchun Song
2020-11-20  8:23   ` Michal Hocko
2020-11-20 10:41     ` [External] " Muchun Song
2020-11-20  6:43 ` [PATCH v5 19/21] mm/hugetlb: Gather discrete indexes of tail page Muchun Song
2020-11-20  6:43 ` [PATCH v5 20/21] mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct page Muchun Song
2020-11-20  6:43 ` [PATCH v5 21/21] mm/hugetlb: Disable freeing vmemmap if struct page size is not power of two Muchun Song
2020-11-20  8:25   ` Michal Hocko
2020-11-20  9:15     ` David Hildenbrand
2020-11-22 13:30       ` Mike Rapoport
2020-11-22 19:00     ` Matthew Wilcox
2020-11-23  3:14       ` [External] " Muchun Song
2020-11-20  9:16   ` David Hildenbrand
2020-11-20 10:42     ` [External] " Muchun Song
2020-11-20  8:42 ` [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Michal Hocko
2020-11-20  9:27   ` David Hildenbrand
2020-11-20  9:39     ` Michal Hocko
2020-11-20  9:43       ` David Hildenbrand
2020-11-20 17:45         ` Mike Kravetz
2020-11-20 18:00           ` David Hildenbrand
2020-11-22  7:29           ` [External] " Muchun Song
2020-11-23  7:38           ` Michal Hocko
2020-11-23 21:52             ` Mike Kravetz
2020-11-23 22:01               ` Matthew Wilcox
2020-11-20 12:40   ` [External] " Muchun Song
2020-11-20 13:11     ` Michal Hocko
2020-11-20 15:44       ` Muchun Song
2020-11-23  7:40         ` Michal Hocko
2020-11-23  8:53           ` Muchun Song
2020-11-23  9:43             ` Michal Hocko
2020-11-23 10:36               ` Muchun Song
2020-11-23 10:42                 ` Michal Hocko
2020-11-23 11:16                   ` Muchun Song
2020-11-23 11:32                     ` Michal Hocko
2020-11-23 12:07                       ` Muchun Song
2020-11-23 12:18                         ` Michal Hocko
2020-11-23 12:40                           ` Muchun Song
2020-11-23 12:48                             ` Michal Hocko
2020-11-23 12:45                   ` Matthew Wilcox
2020-11-23 13:05                     ` Muchun Song
2020-11-23 13:13                     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).