linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Allocate memmap from hotadded memory (per device)
@ 2020-12-17 13:07 Oscar Salvador
  2020-12-17 13:07 ` [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

I figured I would send a new version before going on vacation, so I can
work on it when I am back.

Changes from RFCv3 to Patchv1:
 - Addressed feedback from David
 - Re-order patches

Changes from v2 -> v3:
 - Re-order patches (Michal)
 - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
 - Add kernel boot option to enable this feature (Michal)

Changes from v1 -> v2:
 - Addressed feedback provided by David
 - Add a arch_support_memmap_on_memory to be called
   from mhp_supports_memmap_on_memory, as atm,
   only ARM, powerpc and x86_64 have altmat support.


Original cover letter:

----

The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
The current way we use to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.
c) due to fragmentation we might end up populating memmap with base
   pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate.
memory_block structure gained a new field called nr_vmemmap_pages.
This plays well for two reasons:

 1) {offline/online}_pages know the difference between start_pfn and
    buddy_start_pfn, which is start_pfn + nr_vmemmap_pages.
    In this way all isolation/migration operations are
    done to within the right range of memory without vmemmap pages.
    This allows us for a much cleaner handling.

 2) In try_remove_memory, we construct a new vmemap_altmap struct with the
    right information based on memory_block->nr_vmemap_pages, so we end up
    calling vmem_altmap_free instead of free_pagetable when removing the memory.

Oscar Salvador (5):
  mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  mm,memory_hotplug: Allocate memmap from the added memory range
  acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  powerpc/memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory

 .../admin-guide/kernel-parameters.txt         |  14 ++
 arch/arm64/Kconfig                            |   4 +
 arch/powerpc/Kconfig                          |   4 +
 .../platforms/pseries/hotplug-memory.c        |   5 +-
 arch/x86/Kconfig                              |   4 +
 drivers/acpi/acpi_memhotplug.c                |   5 +-
 drivers/base/memory.c                         |  20 ++-
 include/linux/memory.h                        |   8 +-
 include/linux/memory_hotplug.h                |  21 ++-
 include/linux/memremap.h                      |   2 +-
 include/linux/mmzone.h                        |   5 +
 mm/Kconfig                                    |   3 +
 mm/Makefile                                   |   5 +-
 mm/memory_hotplug.c                           | 158 +++++++++++++++---
 mm/page_alloc.c                               |   4 +-
 15 files changed, 224 insertions(+), 38 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
@ 2020-12-17 13:07 ` Oscar Salvador
  2021-01-11 16:52   ` David Hildenbrand
  2020-12-17 13:07 ` [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

In order to use self-hosted memmap array, the platform needs to have
support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
Currently, only arm64, PPC and x86_64 have the support, so enable those
platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE will be checked later on to see whether
we can enable this feature or not.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/Kconfig   | 4 ++++
 arch/powerpc/Kconfig | 4 ++++
 arch/x86/Kconfig     | 4 ++++
 mm/Kconfig           | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 33446269f692..2daa29ab9057 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -307,6 +307,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	def_bool y
+	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
+
 config SMP
 	def_bool y
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..30322105f145 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -527,6 +527,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+        def_bool y
+        depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
+
 config PPC64_SUPPORTS_MEMORY_FAILURE
 	bool "Add support for memory hwpoison"
 	depends on PPC_BOOK3S_64
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 42d84d86f1f4..742b29169b47 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2443,6 +2443,10 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 	depends on MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	def_bool y
+	depends on X86_64 && MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
+
 config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
diff --git a/mm/Kconfig b/mm/Kconfig
index 7af1a55b708e..6156114b7974 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -183,6 +183,9 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	bool
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2020-12-17 13:07 ` [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
@ 2020-12-17 13:07 ` Oscar Salvador
  2021-01-18 10:11   ` Oscar Salvador
  2021-01-19 13:58   ` David Hildenbrand
  2020-12-17 13:07 ` [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This comes in handy as in {online,offline}_pages, all the isolation and
migration is being done on (buddy_start_pfn, end_pfn] range,
being buddy_start_pfn = start_pfn + nr_vmemmap_pages.

In this way, we have:

(start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
(buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c          |  20 +++--
 include/linux/memory.h         |   8 +-
 include/linux/memory_hotplug.h |  21 ++++-
 include/linux/memremap.h       |   2 +-
 include/linux/mmzone.h         |   5 ++
 mm/memory_hotplug.c            | 138 ++++++++++++++++++++++++++++-----
 mm/page_alloc.c                |   4 +-
 7 files changed, 163 insertions(+), 35 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index eef4ffb6122c..7219e805dcaa 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -175,7 +175,7 @@ int memory_notify(unsigned long val, void *v)
  */
 static int
 memory_block_action(unsigned long start_section_nr, unsigned long action,
-		    int online_type, int nid)
+		    int online_type, int nid, unsigned long nr_vmemmap_pages)
 {
 	unsigned long start_pfn;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
+				   online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages);
+		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
@@ -211,7 +212,7 @@ static int memory_block_change_state(struct memory_block *mem,
 		mem->state = MEM_GOING_OFFLINE;
 
 	ret = memory_block_action(mem->start_section_nr, to_state,
-				  mem->online_type, mem->nid);
+				  mem->online_type, mem->nid, mem->nr_vmemmap_pages);
 
 	mem->state = ret ? from_state_req : to_state;
 
@@ -571,7 +572,8 @@ int register_memory(struct memory_block *memory)
 	return ret;
 }
 
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+			     unsigned long nr_vmemmap_pages)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
@@ -591,6 +593,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 	mem->nid = NUMA_NO_NODE;
+	mem->nr_vmemmap_pages = nr_vmemmap_pages;
 
 	ret = register_memory(mem);
 
@@ -610,7 +613,7 @@ static int add_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -632,7 +635,8 @@ static void unregister_memory(struct memory_block *memory)
  *
  * Called under device_hotplug_lock.
  */
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -645,7 +649,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
 		if (ret)
 			break;
 	}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 439a89e758d8..f09d60ef8ad9 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -30,6 +30,11 @@ struct memory_block {
 	int phys_device;		/* to which fru does this belong? */
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	unsigned long nr_vmemmap_pages;	/*
+					 * Number of vmemmap pages. These pages
+					 * lay at the beginning of the memory
+					 * block.
+					 */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -81,7 +86,8 @@ static inline int memory_notify(unsigned long val, void *v)
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 15acce5ab106..67ea4a0b25bd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -70,6 +70,14 @@ typedef int __bitwise mhp_t;
  */
 #define MEMHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -113,11 +121,13 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			int online_type, int nid);
+			unsigned long nr_vmemmap_pages, int online_type,
+			int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn);
+				     unsigned long end_pfn,
+				     unsigned long buddy_start_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
@@ -312,7 +322,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 unsigned long nr_vmemmap_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
@@ -320,7 +331,8 @@ extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 #else
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				unsigned long nr_vmemmap_pages)
 {
 	return -EINVAL;
 }
@@ -364,6 +376,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..3de5d482ac1a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,7 +17,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..d16b4411e422 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -407,6 +407,11 @@ enum zone_type {
 	 *    techniques might use alloc_contig_range() to hide previously
 	 *    exposed pages from the buddy again (e.g., to implement some sort
 	 *    of memory unplug in virtio-mem).
+	 * 6. Memory-hotplug: when using memmap_on_memory and onlining the memory
+	 *    to the MOVABLE zone, the vmemmap pages are also placed in such
+	 *    zone. Such pages cannot be really moved around as they are
+	 *    self-stored in the range, but they are treated as movable when
+	 *    the range that they describe is about to be offlined.
 	 *
 	 * In general, no unmovable allocations that degrade memory offlining
 	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a8cef4955907..9371e7d3f583 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
 #include "internal.h"
 #include "shuffle.h"
 
+static bool memmap_on_memory_enabled;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -606,10 +608,23 @@ void generic_online_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+			       unsigned long buddy_start_pfn)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn;
+	unsigned long pfn = buddy_start_pfn;
+	unsigned int order = MAX_ORDER - 1;
+
+	/*
+	 * When using memmap_on_memory, the range might be unaligned as the
+	 * first pfns are used for vmemmap pages. Align it in case we need to.
+	 */
+	if (pfn & ((1 << order) - 1)) {
+		while (pfn & ((1 << order) - 1))
+			order--;
+		(*online_page_callback)(pfn_to_page(pfn), order);
+		pfn += 1 << order;
+	}
 
 	/*
 	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -617,7 +632,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
 	 */
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+	for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
 		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
 
 	/* mark all involved sections as online */
@@ -777,9 +792,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+		       unsigned long nr_vmemmap_pages, int online_type, int nid)
 {
-	unsigned long flags;
+	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
@@ -790,11 +805,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
+	if (nr_vmemmap_pages)
+		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
+				       MIGRATE_UNMOVABLE);
+	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+			       MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
@@ -810,7 +832,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
@@ -823,7 +845,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}
 
-	online_pages_range(pfn, nr_pages);
+	online_pages_range(pfn, nr_pages, buddy_start_pfn);
 	zone->present_pages += nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -836,7 +858,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+	undo_isolate_page_range(buddy_start_pfn,
+				buddy_start_pfn + buddy_nr_pages,
+				MIGRATE_MOVABLE);
 
 	/*
 	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1011,6 +1035,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	return memmap_on_memory_enabled &&
+	       size == memory_block_size_bytes();
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1020,6 +1050,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+	struct vmem_altmap mhp_altmap = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1046,13 +1077,32 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 	new_node = ret;
 
+	/*
+	 * Return -EINVAL if the caller specified MHP_MEMMAP_ON_MEMORY and we do
+	 * not support it.
+	 */
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+	    !mhp_supports_memmap_on_memory(size)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
+	/*
+	 * Self hosted memmap array
+	 */
+	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+		mhp_altmap.free = size >> PAGE_SHIFT;
+		mhp_altmap.base_pfn = start >> PAGE_SHIFT;
+		params.altmap = &mhp_altmap;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
 		arch_remove_memory(nid, start, size, NULL);
 		goto error;
@@ -1448,10 +1498,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			unsigned long nr_vmemmap_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, system_ram_pages = 0;
+	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1463,6 +1514,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/*
@@ -1498,7 +1552,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone_pcp_disable(zone);
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
+	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1518,7 +1572,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 
 	do {
-		pfn = start_pfn;
+		pfn = buddy_start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1549,18 +1603,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}
 
-		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
 
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
 	pr_info("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -1569,13 +1623,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
 	zone->present_pages -= nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1635,6 +1689,16 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
+static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
+	int ret = !mem->nr_vmemmap_pages;
+
+	if (!ret)
+		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
+	return ret;
+}
+
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1709,6 +1773,9 @@ EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	unsigned long nr_vmemmap_pages = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1721,6 +1788,37 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 
+	/*
+	 * When using MHP_MEMMAP_ON_MEMORY, add_memory and remove_memory needs
+	 * to work with the same granularity.
+	 * Check that we are not violating that assumption.
+	 * The only way we have right now is to walk all the memory
+	 * blocks we are trying to remove and see if any has vmemmap pages.
+	 * We should not have any as memmap_on_memory only works with ranges
+	 * that spans a single memory block, so if we found any and size spans
+	 * more than a memory block, we know we have to scream.
+	 * We only need to check this if memmap_on_memory_enabled=true.
+	 */
+	if (memmap_on_memory_enabled) {
+		(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
+					 get_memblock_vmemmap_pages_cb);
+		if (nr_vmemmap_pages) {
+			if (size != memory_block_size_bytes()) {
+				pr_warn("Refuse to remove %#llx - %#llx, wrong granularity\n"
+					 start, start + size);
+				return -EINVAL;
+			}
+
+			/*
+			 * Let remove_pmd_table->free_hugepage_table
+			 * do the right thing if we used vmem_altmap
+			 * when hot-adding the range.
+			 */
+			mhp_altmap.alloc = nr_vmemmap_pages;
+			altmap = &mhp_altmap;
+		}
+	}
+
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1732,7 +1830,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, NULL);
+	arch_remove_memory(nid, start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 774542e1483e..02bca3b880a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8825,7 +8825,8 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
+			      unsigned long buddy_start_pfn)
 {
 	unsigned long pfn = start_pfn;
 	struct page *page;
@@ -8836,6 +8837,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
+	pfn = buddy_start_pfn;
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
 		/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2020-12-17 13:07 ` [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
  2020-12-17 13:07 ` [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2020-12-17 13:07 ` Oscar Salvador
  2021-01-11 16:53   ` David Hildenbrand
  2020-12-17 13:07 ` [PATCH 4/5] powerpc/memhotplug: " Oscar Salvador
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/acpi/acpi_memhotplug.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 	acpi_handle handle = mem_device->device->handle;
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
+	mhp_t mhp_flags = MHP_NONE;
 	int node;
 
 	node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
+		if (mhp_supports_memmap_on_memory(info->length))
+			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(node, info->start_addr, info->length,
-				      MHP_NONE);
+				      mhp_flags);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] powerpc/memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (2 preceding siblings ...)
  2020-12-17 13:07 ` [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
@ 2020-12-17 13:07 ` Oscar Salvador
  2021-01-11 16:55   ` David Hildenbrand
  2020-12-17 13:07 ` [PATCH 5/5] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
  2021-01-11  9:05 ` [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
  5 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 7efe6ec5d14a..a7f68e282ec1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -610,6 +610,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
+	mhp_t mhp_flags = MHP_NONE;
 	unsigned long block_sz;
 	int nid, rc;
 
@@ -629,8 +630,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 	if (nid < 0 || !node_possible(nid))
 		nid = first_online_node;
 
+	if (mhp_supports_memmap_on_memory(block_sz))
+		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 	/* Add the memory */
-	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (3 preceding siblings ...)
  2020-12-17 13:07 ` [PATCH 4/5] powerpc/memhotplug: " Oscar Salvador
@ 2020-12-17 13:07 ` Oscar Salvador
  2021-01-11  9:05 ` [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
  5 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-12-17 13:07 UTC (permalink / raw)
  To: akpm
  Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin,
	Oscar Salvador

Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 .../admin-guide/kernel-parameters.txt         | 14 ++++++++++++
 mm/Makefile                                   |  5 ++++-
 mm/memory_hotplug.c                           | 22 ++++++++++++++++++-
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c722ec19cd00..8ff3d7c87165 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2776,6 +2776,20 @@
 			seconds.  Use this parameter to check at some
 			other rate.  0 disables periodic checking.
 
+	memory_hotplug.memmap_on_memory
+			[KNL,X86,ARM,PPC] Boolean flag to enable this feature.
+			Format: {on | off (default)}
+			When enabled, memory to build the pages tables for the
+			memmap array describing the hot-added range will be taken
+			from the range itself, so the memmap page tables will be
+			self-hosted.
+			Since only single memory device ranges are supported at
+			the moment, this option is disabled by default because
+			it might have an impact on workloads that needs large
+			contiguous memory chunks.
+			The state of the flag can be read in
+			/sys/module/memory_hotplug/parameters/memmap_on_memory.
+
 	memtest=	[KNL,X86,ARM,PPC] Enable memtest
 			Format: <integer>
 			default : 0 <disable>
diff --git a/mm/Makefile b/mm/Makefile
index 6b581f8337e8..0425cec6cb17 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)	+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9371e7d3f583..f5f95f49b98a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,27 @@
 #include "internal.h"
 #include "shuffle.h"
 
-static bool memmap_on_memory_enabled;
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory_enabled __ro_after_init;
+
+static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%s\n",
+		       memmap_on_memory_enabled ? "on" : "off");
+}
+
+static __meminit int memmap_on_memory_store(const char *val,
+					    const struct kernel_param *kp)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
+		return -EINVAL;
+
+	return param_set_bool(val, kp);
+}
+module_param_call(memmap_on_memory, memmap_on_memory_store,
+		  memmap_on_memory_show, &memmap_on_memory_enabled, 0400);
 
 /*
  * online_page_callback contains pointer to current page onlining function.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Allocate memmap from hotadded memory (per device)
  2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (4 preceding siblings ...)
  2020-12-17 13:07 ` [PATCH 5/5] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
@ 2021-01-11  9:05 ` Oscar Salvador
  5 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2021-01-11  9:05 UTC (permalink / raw)
  To: akpm; +Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Thu, Dec 17, 2020 at 02:07:53PM +0100, Oscar Salvador wrote:
> I figured I would send a new version before going on vacation, so I can
> work on it when I am back.
> 
> Changes from RFCv3 to Patchv1:
>  - Addressed feedback from David
>  - Re-order patches
> 
> Changes from v2 -> v3:
>  - Re-order patches (Michal)
>  - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
>  - Add kernel boot option to enable this feature (Michal)
> 
> Changes from v1 -> v2:
>  - Addressed feedback provided by David
>  - Add a arch_support_memmap_on_memory to be called
>    from mhp_supports_memmap_on_memory, as atm,
>    only ARM, powerpc and x86_64 have altmat support.

Now that Christmas holidays are gone, let us dig this up.

thanks

 
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2020-12-17 13:07 ` [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
@ 2021-01-11 16:52   ` David Hildenbrand
  2021-01-12  7:26     ` Oscar Salvador
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-11 16:52 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 17.12.20 14:07, Oscar Salvador wrote:
> In order to use self-hosted memmap array, the platform needs to have
> support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
> Currently, only arm64, PPC and x86_64 have the support, so enable those
> platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

"In the first version, only .... will support it".

I'd probably split off enabling it per architecture in separate patches
and the end of the series.

Apart from that looks good.

> 
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE will be checked later on to see whether
> we can enable this feature or not.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-17 13:07 ` [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
@ 2021-01-11 16:53   ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2021-01-11 16:53 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 17.12.20 14:07, Oscar Salvador wrote:
> Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
> checking mhp_supports_memmap_on_memory().
> MHP_MEMMAP_ON_MEMORY can only be set in case
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
> altmap, and the range to be added spans a single memory block.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  drivers/acpi/acpi_memhotplug.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index b02fd51e5589..8cc195c4c861 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  	acpi_handle handle = mem_device->device->handle;
>  	int result, num_enabled = 0;
>  	struct acpi_memory_info *info;
> +	mhp_t mhp_flags = MHP_NONE;
>  	int node;
>  
>  	node = acpi_get_node(handle);
> @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (node < 0)
>  			node = memory_add_physaddr_to_nid(info->start_addr);
>  
> +		if (mhp_supports_memmap_on_memory(info->length))
> +			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  		result = __add_memory(node, info->start_addr, info->length,
> -				      MHP_NONE);
> +				      mhp_flags);
>  
>  		/*
>  		 * If the memory block has been used by the kernel, add_memory()
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] powerpc/memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-17 13:07 ` [PATCH 4/5] powerpc/memhotplug: " Oscar Salvador
@ 2021-01-11 16:55   ` David Hildenbrand
  2021-01-12 13:31     ` Oscar Salvador
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-11 16:55 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 17.12.20 14:07, Oscar Salvador wrote:
> Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
> checking mhp_supports_memmap_on_memory().
> MHP_MEMMAP_ON_MEMORY can only be set in case
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
> altmap, and the range to be added spans a single memory block.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 7efe6ec5d14a..a7f68e282ec1 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -610,6 +610,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
>  
>  static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  {
> +	mhp_t mhp_flags = MHP_NONE;
>  	unsigned long block_sz;
>  	int nid, rc;
>  
> @@ -629,8 +630,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  	if (nid < 0 || !node_possible(nid))
>  		nid = first_online_node;
>  
> +	if (mhp_supports_memmap_on_memory(block_sz))
> +		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  	/* Add the memory */
> -	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> +	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>  	if (rc) {
>  		invalidate_lmb_associativity_index(lmb);
>  		return rc;

With 16MB LMBs it's quite wasteful - you won't even have a huge page
fitting the the remaining part.

I do wonder if we want this on powerpc only with a bigger LMB/memory
block size (e.g., 256 MB, which is AFAIK the maximum usually found).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-01-11 16:52   ` David Hildenbrand
@ 2021-01-12  7:26     ` Oscar Salvador
  2021-01-12 10:12       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2021-01-12  7:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Mon, Jan 11, 2021 at 05:52:19PM +0100, David Hildenbrand wrote:
> On 17.12.20 14:07, Oscar Salvador wrote:
> > In order to use self-hosted memmap array, the platform needs to have
> > support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
> > Currently, only arm64, PPC and x86_64 have the support, so enable those
> > platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.
> 
> "In the first version, only .... will support it".

I will try to be more specific.

> 
> I'd probably split off enabling it per architecture in separate patches
> and the end of the series.

You mean introducing only mm/Kconfig change in this patch, and then
arch/*/Kconfig changes in separate patches at the end of the series?

I can certainly do that, not sure how much will help with the review,
but it might help when bisecting.

> Apart from that looks good.

Thanks for taking a look David!


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-01-12  7:26     ` Oscar Salvador
@ 2021-01-12 10:12       ` David Hildenbrand
  2021-01-12 11:17         ` Oscar Salvador
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-12 10:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 12.01.21 08:26, Oscar Salvador wrote:
> On Mon, Jan 11, 2021 at 05:52:19PM +0100, David Hildenbrand wrote:
>> On 17.12.20 14:07, Oscar Salvador wrote:
>>> In order to use self-hosted memmap array, the platform needs to have
>>> support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
>>> Currently, only arm64, PPC and x86_64 have the support, so enable those
>>> platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.
>>
>> "In the first version, only .... will support it".
> 
> I will try to be more specific.
> 
>>
>> I'd probably split off enabling it per architecture in separate patches
>> and the end of the series.
> 
> You mean introducing only mm/Kconfig change in this patch, and then
> arch/*/Kconfig changes in separate patches at the end of the series?

Yeah, or squashing the leftovers of this patch (3 LOC) into patch #2.

> 
> I can certainly do that, not sure how much will help with the review,
> but it might help when bisecting.

It's usually nicer to explicitly enable stuff per architecture, stating
why it works on that architecture (and in the best case, even was
tested!). :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-01-12 10:12       ` David Hildenbrand
@ 2021-01-12 11:17         ` Oscar Salvador
  2021-01-12 11:17           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2021-01-12 11:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Tue, Jan 12, 2021 at 11:12:30AM +0100, David Hildenbrand wrote:
> On 12.01.21 08:26, Oscar Salvador wrote:
> > You mean introducing only mm/Kconfig change in this patch, and then
> > arch/*/Kconfig changes in separate patches at the end of the series?
> 
> Yeah, or squashing the leftovers of this patch (3 LOC) into patch #2.

Ok, makes sense.

> > I can certainly do that, not sure how much will help with the review,
> > but it might help when bisecting.
> 
> It's usually nicer to explicitly enable stuff per architecture, stating
> why it works on that architecture (and in the best case, even was
> tested!). :)

Fine by me.
I will prepare another re-spin with that in mind then.

It would be great to have some feedback on patch#2 before that (in case you find
some time ;-).

Thanks!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-01-12 11:17         ` Oscar Salvador
@ 2021-01-12 11:17           ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2021-01-12 11:17 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 12.01.21 12:17, Oscar Salvador wrote:
> On Tue, Jan 12, 2021 at 11:12:30AM +0100, David Hildenbrand wrote:
>> On 12.01.21 08:26, Oscar Salvador wrote:
>>> You mean introducing only mm/Kconfig change in this patch, and then
>>> arch/*/Kconfig changes in separate patches at the end of the series?
>>
>> Yeah, or squashing the leftovers of this patch (3 LOC) into patch #2.
> 
> Ok, makes sense.
> 
>>> I can certainly do that, not sure how much will help with the review,
>>> but it might help when bisecting.
>>
>> It's usually nicer to explicitly enable stuff per architecture, stating
>> why it works on that architecture (and in the best case, even was
>> tested!). :)
> 
> Fine by me.
> I will prepare another re-spin with that in mind then.
> 
> It would be great to have some feedback on patch#2 before that (in case you find
> some time ;-).

Yeah, will get to that soon. Mailbox is flooded right now :D


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] powerpc/memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2021-01-11 16:55   ` David Hildenbrand
@ 2021-01-12 13:31     ` Oscar Salvador
  0 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2021-01-12 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Mon, Jan 11, 2021 at 05:55:37PM +0100, David Hildenbrand wrote:
> On 17.12.20 14:07, Oscar Salvador wrote:
 
> With 16MB LMBs it's quite wasteful - you won't even have a huge page
> fitting the the remaining part.
> 
> I do wonder if we want this on powerpc only with a bigger LMB/memory
> block size (e.g., 256 MB, which is AFAIK the maximum usually found).

Yeah, powerpc is trickier than the others.
It is more difficult to take advante of this in this platfowm as depending
on the page's size, a larger memory are might have been mapped before,
and so, further hot-add operations will be a noop wrt. populating sections,
until the range to be added falls in non-mapped area.

Nevertheless, I will make this wired to larger memory blocks as you said.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-17 13:07 ` [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2021-01-18 10:11   ` Oscar Salvador
  2021-01-19 13:58   ` David Hildenbrand
  1 sibling, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2021-01-18 10:11 UTC (permalink / raw)
  To: akpm; +Cc: david, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Thu, Dec 17, 2020 at 02:07:55PM +0100, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>     populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.
> 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Let us refloat this one before it sinks deeper :-)


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-17 13:07 ` [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
  2021-01-18 10:11   ` Oscar Salvador
@ 2021-01-19 13:58   ` David Hildenbrand
  2021-01-25 10:39     ` Oscar Salvador
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-19 13:58 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 17.12.20 14:07, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>     populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.
> 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

IIRC, there is a conflict with the hpage vmemmap freeing patch set,
right? How are we going to handle that?

[...]

> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 439a89e758d8..f09d60ef8ad9 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -30,6 +30,11 @@ struct memory_block {
>  	int phys_device;		/* to which fru does this belong? */
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	unsigned long nr_vmemmap_pages;	/*
> +					 * Number of vmemmap pages. These pages
> +					 * lay at the beginning of the memory
> +					 * block.
> +					 */

I suggest moving this comment above nr_vmemmap_pages.

>  };
>  
>  int arch_get_memory_phys_device(unsigned long start_pfn);
> @@ -81,7 +86,8 @@ static inline int memory_notify(unsigned long val, void *v)
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> -int create_memory_block_devices(unsigned long start, unsigned long size);
> +int create_memory_block_devices(unsigned long start, unsigned long size,
> +				unsigned long vmemmap_pages);
>  void remove_memory_block_devices(unsigned long start, unsigned long size);
>  extern void memory_dev_init(void);
>  extern int memory_notify(unsigned long val, void *v);
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..67ea4a0b25bd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -70,6 +70,14 @@ typedef int __bitwise mhp_t;
>   */
>  #define MEMHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
>  
> +/*
> + * We want memmap (struct page array) to be self contained.
> + * To do so, we will use the beginning of the hot-added range to build
> + * the page tables for the memmap array that describes the entire range.
> + * Only selected architectures support it with SPARSE_VMEMMAP.
> + */
> +#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
> +
>  /*
>   * Extended parameters for memory hotplug:
>   * altmap: alternative allocator for memmap array (optional)
> @@ -113,11 +121,13 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
>  extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
>  /* VM interface that may be used by firmware interface */
>  extern int online_pages(unsigned long pfn, unsigned long nr_pages,
> -			int online_type, int nid);
> +			unsigned long nr_vmemmap_pages, int online_type,
> +			int nid);
>  extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
>  					 unsigned long end_pfn);
>  extern void __offline_isolated_pages(unsigned long start_pfn,
> -				     unsigned long end_pfn);
> +				     unsigned long end_pfn,
> +				     unsigned long buddy_start_pfn);
>  
>  typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
>  
> @@ -312,7 +322,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  
>  extern void try_offline_node(int nid);
> -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> +extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			 unsigned long nr_vmemmap_pages);
>  extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
>  extern int offline_and_remove_memory(int nid, u64 start, u64 size);
> @@ -320,7 +331,8 @@ extern int offline_and_remove_memory(int nid, u64 start, u64 size);
>  #else
>  static inline void try_offline_node(int nid) {}
>  
> -static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +				unsigned long nr_vmemmap_pages)
>  {
>  	return -EINVAL;
>  }
> @@ -364,6 +376,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
>  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>  				      struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
> +extern bool mhp_supports_memmap_on_memory(unsigned long size);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 86c6c368ce9b..3de5d482ac1a 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,7 +17,7 @@ struct device;
>   * @alloc: track pages consumed, private to vmemmap_populate()
>   */
>  struct vmem_altmap {
> -	const unsigned long base_pfn;
> +	unsigned long base_pfn;
>  	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b593316bff3d..d16b4411e422 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -407,6 +407,11 @@ enum zone_type {
>  	 *    techniques might use alloc_contig_range() to hide previously
>  	 *    exposed pages from the buddy again (e.g., to implement some sort
>  	 *    of memory unplug in virtio-mem).
> +	 * 6. Memory-hotplug: when using memmap_on_memory and onlining the memory
> +	 *    to the MOVABLE zone, the vmemmap pages are also placed in such
> +	 *    zone. Such pages cannot be really moved around as they are
> +	 *    self-stored in the range, but they are treated as movable when
> +	 *    the range that they describe is about to be offlined.
>  	 *
>  	 * In general, no unmovable allocations that degrade memory offlining
>  	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a8cef4955907..9371e7d3f583 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +static bool memmap_on_memory_enabled;
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -606,10 +608,23 @@ void generic_online_page(struct page *page, unsigned int order)
>  }
>  EXPORT_SYMBOL_GPL(generic_online_page);
>  
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long buddy_start_pfn)
>  {
>  	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn;
> +	unsigned long pfn = buddy_start_pfn;
> +	unsigned int order = MAX_ORDER - 1;
> +
> +	/*
> +	 * When using memmap_on_memory, the range might be unaligned as the
> +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> +	 */
> +	if (pfn & ((1 << order) - 1)) {
> +		while (pfn & ((1 << order) - 1))
> +			order--;
> +		(*online_page_callback)(pfn_to_page(pfn), order);
> +		pfn += 1 << order;
> +	}
>  

Most probably we can simplify once we assure buddy_start_pfn follows
specific requirements (e.g., at least aligned to pageblock_nr_pages, see
below).

>  	/*
>  	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
> @@ -617,7 +632,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
>  	 * later). We account all pages as being online and belonging to this
>  	 * zone ("present").
>  	 */
> -	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> +	for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
>  		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
>  
>  	/* mark all involved sections as online */
> @@ -777,9 +792,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>  }
>  
>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       int online_type, int nid)
> +		       unsigned long nr_vmemmap_pages, int online_type, int nid)
>  {
> -	unsigned long flags;
> +	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
>  	struct zone *zone;
>  	int need_zonelists_rebuild = 0;
>  	int ret;
> @@ -790,11 +805,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
>  		return -EINVAL;
>  
> +	buddy_start_pfn = pfn + nr_vmemmap_pages;
> +	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> +
>  	mem_hotplug_begin();
>  
>  	/* associate pfn range with the zone */
>  	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> -	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
> +	if (nr_vmemmap_pages)
> +		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
> +				       MIGRATE_UNMOVABLE);

One sign that we have to take care that nr_vmemmap_pages has a certain
minimum alignment.

> +	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
> +			       MIGRATE_ISOLATE);
>  
>  	arg.start_pfn = pfn;
>  	arg.nr_pages = nr_pages;
> @@ -810,7 +832,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  	 * onlining, such that undo_isolate_page_range() works correctly.
>  	 */
>  	spin_lock_irqsave(&zone->lock, flags);
> -	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
> +	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;

And, another sign :)

>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	/*
> @@ -823,7 +845,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  		setup_zone_pageset(zone);
>  	}
>  
> -	online_pages_range(pfn, nr_pages);
> +	online_pages_range(pfn, nr_pages, buddy_start_pfn);
>  	zone->present_pages += nr_pages;
>  
>  	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -836,7 +858,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  	zone_pcp_update(zone);
>  
>  	/* Basic onlining is complete, allow allocation of onlined pages. */
> -	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
> +	undo_isolate_page_range(buddy_start_pfn,
> +				buddy_start_pfn + buddy_nr_pages,
> +				MIGRATE_MOVABLE);

Yet another sign :)

>  
>  	/*
>  	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> @@ -1011,6 +1035,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  	return device_online(&mem->dev);
>  }
>  
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	return memmap_on_memory_enabled &&
> +	       size == memory_block_size_bytes();
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1020,6 +1050,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
>  	struct mhp_params params = { .pgprot = PAGE_KERNEL };
> +	struct vmem_altmap mhp_altmap = {};
>  	u64 start, size;
>  	bool new_node = false;
>  	int ret;
> @@ -1046,13 +1077,32 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  		goto error;
>  	new_node = ret;
>  
> +	/*
> +	 * Return -EINVAL if the caller specified MHP_MEMMAP_ON_MEMORY and we do
> +	 * not support it.
> +	 */

The "Return -EINVAL" part is obvious. Maybe drop that comment.

> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> +	    !mhp_supports_memmap_on_memory(size)) {
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
> +	/*
> +	 * Self hosted memmap array
> +	 */
> +	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> +		mhp_altmap.free = size >> PAGE_SHIFT;
> +		mhp_altmap.base_pfn = start >> PAGE_SHIFT;

Could use PFN_PHYS() for both.

> +		params.altmap = &mhp_altmap;
> +	}
> +

I'd combine both parts

if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
	if (!mhp_supports_memmap_on_memory(size)) {
		ret = -EINVAL;
		goto error;
	}
	mhp_altmap.free = PFN_PHYS(size);
	mhp_altmap.base_pfn = PFN_PHYS(size);
	params.altmap = &mhp_altmap;

}

>  	/* call arch's memory hotadd */
>  	ret = arch_add_memory(nid, start, size, &params);
>  	if (ret < 0)
>  		goto error;
>  
>  	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size);
> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);

Interresting, so we automatically support differeing sizeof(struct
page). I guess it will be problematic in case of sizeof(struct page) !=
64, because then, we might not have multiples of 2MB for the memmap of a
memory block.

IIRC, it could happen that if we add two consecutive memory blocks, that
the second one might reuse parts of the vmemmap residing on the first
memory block. If you remove the first one, you might be in trouble.

E.g., on x86-64
 vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
- Populate a huge page

vmemmap_free()->remove_pagetable()...->remove_pmd_table():
- memchr_inv() will leave the hugepage populated.

Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
Or do we somehow want to fix that? We should never populate partial huge
pages from an altmap ...

But maybe I am missing something.

>  	if (ret) {
>  		arch_remove_memory(nid, start, size, NULL);
>  		goto error;
> @@ -1448,10 +1498,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
>  	return 0;
>  }
>  
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			unsigned long nr_vmemmap_pages)
>  {
>  	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn, system_ram_pages = 0;
> +	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
>  	unsigned long flags;
>  	struct zone *zone;
>  	struct memory_notify arg;
> @@ -1463,6 +1514,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
>  		return -EINVAL;
>  
> +	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
> +	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> +
>  	mem_hotplug_begin();
>  
>  	/*
> @@ -1498,7 +1552,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	zone_pcp_disable(zone);
>  
>  	/* set above range as isolated */
> -	ret = start_isolate_page_range(start_pfn, end_pfn,
> +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
>  				       MEMORY_OFFLINE | REPORT_FAILURE);

If buddy_start_pfn is not aligned to pageblocks,
start_isolate_page_range will BUG_ON(). another sign that we need to
check that somewhere while adding that nr_vmemmap_pages is suitable.


[...]

>  
> +static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg)

Let's avoid using the "memblock" terminology. Just call it
"get_nr_vmemmap_pages_cb" or maybe "detect_memmap_on_memory_cb"

> +{
> +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> +	int ret = !mem->nr_vmemmap_pages;
> +
> +	if (!ret)
> +		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> +	return ret;

I think you can avoid "ret".

Also, we want to stop whenever we find "mem->nr_vmemmap_pages != 0",
right? AFAIU, you would currently stop whenever you find
mem->nr_vmemmap_pages == 0", which is wrong.

*nr_vmemmap_pages += mem->nr_vmemmap_pages;
/* Stop whenever we find any sign of MHP_MEMMAP_ON_MEMORY. */
return mem->nr_vmemmap_pages;

> +}
> +
>  static int check_cpu_on_node(pg_data_t *pgdat)
>  {
>  	int cpu;
> @@ -1709,6 +1773,9 @@ EXPORT_SYMBOL(try_offline_node);
>  static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  {
>  	int rc = 0;
> +	struct vmem_altmap mhp_altmap = {};
> +	struct vmem_altmap *altmap = NULL;
> +	unsigned long nr_vmemmap_pages = 0;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -1721,6 +1788,37 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  	if (rc)
>  		return rc;
>  
> +	/*
> +	 * When using MHP_MEMMAP_ON_MEMORY, add_memory and remove_memory needs
> +	 * to work with the same granularity.
> +	 * Check that we are not violating that assumption.
> +	 * The only way we have right now is to walk all the memory
> +	 * blocks we are trying to remove and see if any has vmemmap pages.
> +	 * We should not have any as memmap_on_memory only works with ranges
> +	 * that spans a single memory block, so if we found any and size spans
> +	 * more than a memory block, we know we have to scream.
> +	 * We only need to check this if memmap_on_memory_enabled=true.
> +	 */

Can we simplify that comment?

/*
 * We only support remvoing memory added with MHP_MEMMAP_ON_MEMORY in
 * the same granularity it was added - a single memory block.
 */

> +	if (memmap_on_memory_enabled) {
> +		(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
> +					 get_memblock_vmemmap_pages_cb);

you can drop the "(void)", just like other users that ignore the return
value.

> +		if (nr_vmemmap_pages) {
> +			if (size != memory_block_size_bytes()) {
> +				pr_warn("Refuse to remove %#llx - %#llx, wrong granularity\n"
> +					 start, start + size);
> +				return -EINVAL;
> +			}
> +
> +			/*
> +			 * Let remove_pmd_table->free_hugepage_table
> +			 * do the right thing if we used vmem_altmap
> +			 * when hot-adding the range.
> +			 */
> +			mhp_altmap.alloc = nr_vmemmap_pages;
> +			altmap = &mhp_altmap;
> +		}
> +	}
> +



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-19 13:58   ` David Hildenbrand
@ 2021-01-25 10:39     ` Oscar Salvador
  2021-01-25 10:56       ` Oscar Salvador
  2021-01-25 10:57       ` David Hildenbrand
  0 siblings, 2 replies; 24+ messages in thread
From: Oscar Salvador @ 2021-01-25 10:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Tue, Jan 19, 2021 at 02:58:41PM +0100, David Hildenbrand wrote:
> IIRC, there is a conflict with the hpage vmemmap freeing patch set,
> right? How are we going to handle that?

First of all, sorry for the lateness David, was a bit busy.

AFAIK, there is no conflict at the moment.
hpage vmemmap populates with basepages (PMD support is disabled).

But even with PMD support enabled, I cannot see how it can interfere.

From the mhp point of view vmemmap pages only matter when allocating it and
when freeing it.
The only moment we would have to worry about is when we have to get rid of them
(aka: the memory range that describe is being hot-removed), but the hpage
vmemmap patchset makes sure that by the time we return from dissolve_free_huge_page
the vemmap pages look as they did before the remapping.

But as I said, we do not have to worry about this one as that feature only works
when populating with base pages.

> > diff --git a/include/linux/memory.h b/include/linux/memory.h
> > index 439a89e758d8..f09d60ef8ad9 100644
> > --- a/include/linux/memory.h
> > +++ b/include/linux/memory.h
> > @@ -30,6 +30,11 @@ struct memory_block {
> >  	int phys_device;		/* to which fru does this belong? */
> >  	struct device dev;
> >  	int nid;			/* NID for this memory block */
> > +	unsigned long nr_vmemmap_pages;	/*
> > +					 * Number of vmemmap pages. These pages
> > +					 * lay at the beginning of the memory
> > +					 * block.
> > +					 */
> 
> I suggest moving this comment above nr_vmemmap_pages.

Sure, will do.

> > -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> > +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> > +			       unsigned long buddy_start_pfn)
> >  {
> >  	const unsigned long end_pfn = start_pfn + nr_pages;
> > -	unsigned long pfn;
> > +	unsigned long pfn = buddy_start_pfn;
> > +	unsigned int order = MAX_ORDER - 1;
> > +
> > +	/*
> > +	 * When using memmap_on_memory, the range might be unaligned as the
> > +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> > +	 */
> > +	if (pfn & ((1 << order) - 1)) {
> > +		while (pfn & ((1 << order) - 1))
> > +			order--;
> > +		(*online_page_callback)(pfn_to_page(pfn), order);
> > +		pfn += 1 << order;
> > +	}
> >  
> 
> Most probably we can simplify once we assure buddy_start_pfn follows
> specific requirements (e.g., at least aligned to pageblock_nr_pages, see
> below).

Yeah, on x86_64 this is pageblock_nr_pages aligned, but will have to check
on arm64.

> > +	buddy_start_pfn = pfn + nr_vmemmap_pages;
> > +	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> > +
> >  	mem_hotplug_begin();
> >  
> >  	/* associate pfn range with the zone */
> >  	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> > -	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
> > +	if (nr_vmemmap_pages)
> > +		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
> > +				       MIGRATE_UNMOVABLE);
> 
> One sign that we have to take care that nr_vmemmap_pages has a certain
> minimum alignment.

Right.

> > +	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
> > +			       MIGRATE_ISOLATE);
> >  
> >  	arg.start_pfn = pfn;
> >  	arg.nr_pages = nr_pages;
> > @@ -810,7 +832,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> >  	 * onlining, such that undo_isolate_page_range() works correctly.
> >  	 */
> >  	spin_lock_irqsave(&zone->lock, flags);
> > -	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
> > +	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
> 
> And, another sign :)

Yap

> > -	online_pages_range(pfn, nr_pages);
> > +	online_pages_range(pfn, nr_pages, buddy_start_pfn);
> >  	zone->present_pages += nr_pages;
> >  
> >  	pgdat_resize_lock(zone->zone_pgdat, &flags);
> > @@ -836,7 +858,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> >  	zone_pcp_update(zone);
> >  
> >  	/* Basic onlining is complete, allow allocation of onlined pages. */
> > -	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
> > +	undo_isolate_page_range(buddy_start_pfn,
> > +				buddy_start_pfn + buddy_nr_pages,
> > +				MIGRATE_MOVABLE);
> 
> Yet another sign :)

Yap


> >  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> >  {
> >  	struct mhp_params params = { .pgprot = PAGE_KERNEL };
> > +	struct vmem_altmap mhp_altmap = {};
> >  	u64 start, size;
> >  	bool new_node = false;
> >  	int ret;
> > @@ -1046,13 +1077,32 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> >  		goto error;
> >  	new_node = ret;
> >  
> > +	/*
> > +	 * Return -EINVAL if the caller specified MHP_MEMMAP_ON_MEMORY and we do
> > +	 * not support it.
> > +	 */
> 
> The "Return -EINVAL" part is obvious. Maybe drop that comment.

Sure.

> > +	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > +		mhp_altmap.free = size >> PAGE_SHIFT;
> > +		mhp_altmap.base_pfn = start >> PAGE_SHIFT;
> 
> Could use PFN_PHYS() for both.

Will do.

> I'd combine both parts
> 
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> 	if (!mhp_supports_memmap_on_memory(size)) {
> 		ret = -EINVAL;
> 		goto error;
> 	}
> 	mhp_altmap.free = PFN_PHYS(size);
> 	mhp_altmap.base_pfn = PFN_PHYS(size);
> 	params.altmap = &mhp_altmap;
> 
> }

Ok, cleaner.

> Interresting, so we automatically support differeing sizeof(struct
> page). I guess it will be problematic in case of sizeof(struct page) !=
> 64, because then, we might not have multiples of 2MB for the memmap of a
> memory block.
> 
> IIRC, it could happen that if we add two consecutive memory blocks, that
> the second one might reuse parts of the vmemmap residing on the first
> memory block. If you remove the first one, you might be in trouble.
> 
> E.g., on x86-64
>  vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
> - Populate a huge page
> 
> vmemmap_free()->remove_pagetable()...->remove_pmd_table():
> - memchr_inv() will leave the hugepage populated.
> 
> Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
> Or do we somehow want to fix that? We should never populate partial huge
> pages from an altmap ...
> 
> But maybe I am missing something.

No, you are not missing anything.

I think that remove_pmd_table() should be fixed.
vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
is large and 2) the chunk is not aligned, the memset() should be writing
PAGE_INUSE for a PMD chunk.

I do not really a see a reason why this should not happen.
Actually, we will be leaving pagetables behind as we never get to free
the PMD chunk when sizeof(struct page) is not a multiple of 2MB.

I plan to fix remove_pmd_table(), but for now let us not allow to use this feature
if the size of a struct page is not 64.
Let us keep it simply for the time being, shall we?

> >  	/* set above range as isolated */
> > -	ret = start_isolate_page_range(start_pfn, end_pfn,
> > +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
> >  				       MIGRATE_MOVABLE,
> >  				       MEMORY_OFFLINE | REPORT_FAILURE);
> 
> If buddy_start_pfn is not aligned to pageblocks,
> start_isolate_page_range will BUG_ON(). another sign that we need to
> check that somewhere while adding that nr_vmemmap_pages is suitable.

Will come up with a way to 1) ensure that vmemmap pages is aligned and
2) disallow to use the feature in case it cannot be aligned for some reason.

 
> Let's avoid using the "memblock" terminology. Just call it
> "get_nr_vmemmap_pages_cb" or maybe "detect_memmap_on_memory_cb"

Fine.

> > +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> > +	int ret = !mem->nr_vmemmap_pages;
> > +
> > +	if (!ret)
> > +		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> > +	return ret;
> 
> I think you can avoid "ret".
> 
> Also, we want to stop whenever we find "mem->nr_vmemmap_pages != 0",
> right? AFAIU, you would currently stop whenever you find
> mem->nr_vmemmap_pages == 0", which is wrong.
> 
> *nr_vmemmap_pages += mem->nr_vmemmap_pages;
> /* Stop whenever we find any sign of MHP_MEMMAP_ON_MEMORY. */
> return mem->nr_vmemmap_pages;

Yeah, true is that current logic is wrong, but we do not want to stop when
mem->nr_vmemmap_pages == 0 either.
Since we want to avoid people to remove with a different granularity than it was
added, we need to check the whole memory range that is going to be hot-removed
and check for mem->nr_vmemmap_pages to make sure that the range does not
have any self-hosted vmemmap pages.

E.g:

[memory_block not using vmemmap] memblock#0
[memory block using vmemmap]	 memblock#1

So, if someone tries to remove memblock#0 and memblock#1 and once, we need
to check whether any of the memblocks used self-hosted vmemmap, and if so,
scream out loud if we are trying to remove site != memblock_size.

> > +	/*
> > +	 * When using MHP_MEMMAP_ON_MEMORY, add_memory and remove_memory needs
> > +	 * to work with the same granularity.
> > +	 * Check that we are not violating that assumption.
> > +	 * The only way we have right now is to walk all the memory
> > +	 * blocks we are trying to remove and see if any has vmemmap pages.
> > +	 * We should not have any as memmap_on_memory only works with ranges
> > +	 * that spans a single memory block, so if we found any and size spans
> > +	 * more than a memory block, we know we have to scream.
> > +	 * We only need to check this if memmap_on_memory_enabled=true.
> > +	 */
> 
> Can we simplify that comment?
> 
> /*
>  * We only support remvoing memory added with MHP_MEMMAP_ON_MEMORY in
>  * the same granularity it was added - a single memory block.
>  */

Yeah.

> 
> > +	if (memmap_on_memory_enabled) {
> > +		(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
> > +					 get_memblock_vmemmap_pages_cb);
> 
> you can drop the "(void)", just like other users that ignore the return
> value.

ok

Thanks for the throughout review!
 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 10:39     ` Oscar Salvador
@ 2021-01-25 10:56       ` Oscar Salvador
  2021-01-25 11:02         ` David Hildenbrand
  2021-01-25 10:57       ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2021-01-25 10:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote:
> > Interresting, so we automatically support differeing sizeof(struct
> > page). I guess it will be problematic in case of sizeof(struct page) !=
> > 64, because then, we might not have multiples of 2MB for the memmap of a
> > memory block.
> > 
> > IIRC, it could happen that if we add two consecutive memory blocks, that
> > the second one might reuse parts of the vmemmap residing on the first
> > memory block. If you remove the first one, you might be in trouble.
> > 
> > E.g., on x86-64
> >  vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
> > - Populate a huge page
> > 
> > vmemmap_free()->remove_pagetable()...->remove_pmd_table():
> > - memchr_inv() will leave the hugepage populated.
> > 
> > Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
> > Or do we somehow want to fix that? We should never populate partial huge
> > pages from an altmap ...
> > 
> > But maybe I am missing something.
> 
> No, you are not missing anything.
> 
> I think that remove_pmd_table() should be fixed.
> vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
> PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
> is large and 2) the chunk is not aligned, the memset() should be writing
> PAGE_INUSE for a PMD chunk.
> 
> I do not really a see a reason why this should not happen.
> Actually, we will be leaving pagetables behind as we never get to free
> the PMD chunk when sizeof(struct page) is not a multiple of 2MB.
> 
> I plan to fix remove_pmd_table(), but for now let us not allow to use this feature
> if the size of a struct page is not 64.
> Let us keep it simply for the time being, shall we?

My first intention was:

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..0c9756a2eb24 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 			continue;
 
 		if (pmd_large(*pmd)) {
-			if (IS_ALIGNED(addr, PMD_SIZE) &&
-			    IS_ALIGNED(next, PMD_SIZE)) {
-				if (!direct)
-					free_hugepage_table(pmd_page(*pmd),
-							    altmap);
-
-				spin_lock(&init_mm.page_table_lock);
-				pmd_clear(pmd);
-				spin_unlock(&init_mm.page_table_lock);
-				pages++;
-			} else {
-				/* If here, we are freeing vmemmap pages. */
-				memset((void *)addr, PAGE_INUSE, next - addr);
-
-				page_addr = page_address(pmd_page(*pmd));
-				if (!memchr_inv(page_addr, PAGE_INUSE,
-						PMD_SIZE)) {
-					free_hugepage_table(pmd_page(*pmd),
-							    altmap);
-
-					spin_lock(&init_mm.page_table_lock);
-					pmd_clear(pmd);
-					spin_unlock(&init_mm.page_table_lock);
-				}
-			}
+			if (!direct)
+				free_hugepage_table(pmd_page(*pmd),
+						    altmap);
 
+			spin_lock(&init_mm.page_table_lock);
+			pmd_clear(pmd);
+			spin_unlock(&init_mm.page_table_lock);
+			pages++;
 			continue;
 		}

I did not try it out yet and this might explode badly, but AFAICS, a PMD size
chunk is always allocated even when the section does not spand 2MB.
E.g: sizeof(struct page) = 56.

PAGES_PER_SECTION * 64 = 2097152
PAGES_PER_SECTION * 56 = 1835008

Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size chunk
to satisfy the unaligned range.
So, unless I am missing something, it should not be a problem to free that 2MB in
remove_pmd_table when 1) the PMD is large and 2) the range is not aligned.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 10:39     ` Oscar Salvador
  2021-01-25 10:56       ` Oscar Salvador
@ 2021-01-25 10:57       ` David Hildenbrand
  2021-01-25 11:18         ` Oscar Salvador
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-25 10:57 UTC (permalink / raw)
  To: Oscar Salvador, Anshuman Khandual
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 25.01.21 11:39, Oscar Salvador wrote:
> On Tue, Jan 19, 2021 at 02:58:41PM +0100, David Hildenbrand wrote:
>> IIRC, there is a conflict with the hpage vmemmap freeing patch set,
>> right? How are we going to handle that?
> 
> First of all, sorry for the lateness David, was a bit busy.
> 
> AFAIK, there is no conflict at the moment.
> hpage vmemmap populates with basepages (PMD support is disabled).

I don't see how that makes a difference? We end up consuming the altmap
via vmemmap_alloc_block_buf(). Read below, most probably I am missing
something.

> 
> But even with PMD support enabled, I cannot see how it can interfere.
> 
> From the mhp point of view vmemmap pages only matter when allocating it and
> when freeing it.
> The only moment we would have to worry about is when we have to get rid of them
> (aka: the memory range that describe is being hot-removed), but the hpage
> vmemmap patchset makes sure that by the time we return from dissolve_free_huge_page
> the vemmap pages look as they did before the remapping.

Aehm I don't think that's the case. there might be a vmemmap but these
are then other pages than the altmap.

> 
> But as I said, we do not have to worry about this one as that feature only works
> when populating with base pages.

I'm confused.

1. Assume we hotplug memory, online it to ZONE_MOVABLE. The vmemmap gets
allocated from altmap space.

2. Allocate huge pages. The vmemmap gets freed, which ends up freeing
pages in the altmap space that are even PageReserved() and not even
accounted properly in managed pages. Problem?

3. Try offlining memory, skip the altmap space. The altmap pages might
now be used for something else in the meantime. Problem?

This should hold with both, basepages and hugepages in the vmemmap. What
am I missing?

[...]

>> IIRC, it could happen that if we add two consecutive memory blocks, that
>> the second one might reuse parts of the vmemmap residing on the first
>> memory block. If you remove the first one, you might be in trouble.
>>
>> E.g., on x86-64
>>  vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
>> - Populate a huge page
>>
>> vmemmap_free()->remove_pagetable()...->remove_pmd_table():
>> - memchr_inv() will leave the hugepage populated.
>>
>> Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
>> Or do we somehow want to fix that? We should never populate partial huge
>> pages from an altmap ...
>>
>> But maybe I am missing something.
> 
> No, you are not missing anything.
> 
> I think that remove_pmd_table() should be fixed.
> vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
> PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
> is large and 2) the chunk is not aligned, the memset() should be writing
> PAGE_INUSE for a PMD chunk.

I tried to handle such stuff (dealing with partially present vmemmap on
a PMD) a little bit smarter in the s390x code I added recently.

Anyhow, we should never ever consume memory from an altmap when it might
cover more than the added range; that really only works with anon
memory, but not with memory from the altmap space.

> 
> I do not really a see a reason why this should not happen.
> Actually, we will be leaving pagetables behind as we never get to free
> the PMD chunk when sizeof(struct page) is not a multiple of 2MB.
> 
> I plan to fix remove_pmd_table(), but for now let us not allow to use this feature
> if the size of a struct page is not 64.
> Let us keep it simply for the time being, shall we?

Fine with me, as long as we never ever accidentally allocate more from
the altmap than we actually need - the vmemmap must either span complete
PMDs or we have to fallback to PTEs for the problematic parts.

So while we require "IS_ALIGNED(nr_vmemmap_pages, pageblock_nr_pages)",
we actually require here "IS_ALIGNED(nr_vmemmap_pages, PMD_SIZE >>
PAGE_SHIFT)" IIUC. (+ alignment of the start address)

For x86-64 this holds, for arm64 as well at least for 4k base pages. I
*hope* 64k base pages never ever use PMDs (512MB) in the vmemmap when
hotplugging :D (did not check, but I assume they don't, would be
horribly ugly otherwise)

> 
>>>  	/* set above range as isolated */
>>> -	ret = start_isolate_page_range(start_pfn, end_pfn,
>>> +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
>>>  				       MIGRATE_MOVABLE,
>>>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>>
>> If buddy_start_pfn is not aligned to pageblocks,
>> start_isolate_page_range will BUG_ON(). another sign that we need to
>> check that somewhere while adding that nr_vmemmap_pages is suitable.
> 
> Will come up with a way to 1) ensure that vmemmap pages is aligned and
> 2) disallow to use the feature in case it cannot be aligned for some reason.
> 
>  
>> Let's avoid using the "memblock" terminology. Just call it
>> "get_nr_vmemmap_pages_cb" or maybe "detect_memmap_on_memory_cb"
> 
> Fine.
> 
>>> +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
>>> +	int ret = !mem->nr_vmemmap_pages;
>>> +
>>> +	if (!ret)
>>> +		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
>>> +	return ret;
>>
>> I think you can avoid "ret".
>>
>> Also, we want to stop whenever we find "mem->nr_vmemmap_pages != 0",
>> right? AFAIU, you would currently stop whenever you find
>> mem->nr_vmemmap_pages == 0", which is wrong.
>>
>> *nr_vmemmap_pages += mem->nr_vmemmap_pages;
>> /* Stop whenever we find any sign of MHP_MEMMAP_ON_MEMORY. */
>> return mem->nr_vmemmap_pages;
> 
> Yeah, true is that current logic is wrong, but we do not want to stop when
> mem->nr_vmemmap_pages == 0 either.

Yeah, as the comment said "Stop whenever we find any sign of
MHP_MEMMAP_ON_MEMORY", whcih is the case with "mem->nr_vmemmap_pages", no?

> Since we want to avoid people to remove with a different granularity than it was
> added, we need to check the whole memory range that is going to be hot-removed
> and check for mem->nr_vmemmap_pages to make sure that the range does not
> have any self-hosted vmemmap pages.
> 
> E.g:
> 
> [memory_block not using vmemmap] memblock#0
> [memory block using vmemmap]	 memblock#1
> 
> So, if someone tries to remove memblock#0 and memblock#1 and once, we need
> to check whether any of the memblocks used self-hosted vmemmap, and if so,
> scream out loud if we are trying to remove site != memblock_size.

That's what my suggestion does, no?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 10:56       ` Oscar Salvador
@ 2021-01-25 11:02         ` David Hildenbrand
  2021-01-25 13:36           ` Oscar Salvador
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2021-01-25 11:02 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 25.01.21 11:56, Oscar Salvador wrote:
> On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote:
>>> Interresting, so we automatically support differeing sizeof(struct
>>> page). I guess it will be problematic in case of sizeof(struct page) !=
>>> 64, because then, we might not have multiples of 2MB for the memmap of a
>>> memory block.
>>>
>>> IIRC, it could happen that if we add two consecutive memory blocks, that
>>> the second one might reuse parts of the vmemmap residing on the first
>>> memory block. If you remove the first one, you might be in trouble.
>>>
>>> E.g., on x86-64
>>>  vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
>>> - Populate a huge page
>>>
>>> vmemmap_free()->remove_pagetable()...->remove_pmd_table():
>>> - memchr_inv() will leave the hugepage populated.
>>>
>>> Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
>>> Or do we somehow want to fix that? We should never populate partial huge
>>> pages from an altmap ...
>>>
>>> But maybe I am missing something.
>>
>> No, you are not missing anything.
>>
>> I think that remove_pmd_table() should be fixed.
>> vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
>> PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
>> is large and 2) the chunk is not aligned, the memset() should be writing
>> PAGE_INUSE for a PMD chunk.
>>
>> I do not really a see a reason why this should not happen.
>> Actually, we will be leaving pagetables behind as we never get to free
>> the PMD chunk when sizeof(struct page) is not a multiple of 2MB.
>>
>> I plan to fix remove_pmd_table(), but for now let us not allow to use this feature
>> if the size of a struct page is not 64.
>> Let us keep it simply for the time being, shall we?
> 
> My first intention was:
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index b5a3fa4033d3..0c9756a2eb24 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  			continue;
>  
>  		if (pmd_large(*pmd)) {
> -			if (IS_ALIGNED(addr, PMD_SIZE) &&
> -			    IS_ALIGNED(next, PMD_SIZE)) {
> -				if (!direct)
> -					free_hugepage_table(pmd_page(*pmd),
> -							    altmap);
> -
> -				spin_lock(&init_mm.page_table_lock);
> -				pmd_clear(pmd);
> -				spin_unlock(&init_mm.page_table_lock);
> -				pages++;
> -			} else {
> -				/* If here, we are freeing vmemmap pages. */
> -				memset((void *)addr, PAGE_INUSE, next - addr);
> -
> -				page_addr = page_address(pmd_page(*pmd));
> -				if (!memchr_inv(page_addr, PAGE_INUSE,
> -						PMD_SIZE)) {
> -					free_hugepage_table(pmd_page(*pmd),
> -							    altmap);
> -
> -					spin_lock(&init_mm.page_table_lock);
> -					pmd_clear(pmd);
> -					spin_unlock(&init_mm.page_table_lock);
> -				}
> -			}
> +			if (!direct)
> +				free_hugepage_table(pmd_page(*pmd),
> +						    altmap);
>  
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_clear(pmd);
> +			spin_unlock(&init_mm.page_table_lock);
> +			pages++;
>  			continue;
>  		}
> 
> I did not try it out yet and this might explode badly, but AFAICS, a PMD size
> chunk is always allocated even when the section does not spand 2MB.
> E.g: sizeof(struct page) = 56.
> 
> PAGES_PER_SECTION * 64 = 2097152
> PAGES_PER_SECTION * 56 = 1835008
> 
> Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size chunk
> to satisfy the unaligned range.
> So, unless I am missing something, it should not be a problem to free that 2MB in
> remove_pmd_table when 1) the PMD is large and 2) the range is not aligned.

Assume you have two consecutive memory blocks with 56 sizeof(struct page).
The first one allocates a PMD (2097152) but only consumes 1835008, the second
one reuses the remaining part and allocates another PMD (1835008),
only using parts of it.

Ripping out a memory block, along with the PMD in the vmemmap would
remove parts of the vmemmap of another memory block.

You might want to take a look at:

commit 2c114df071935762ffa88144cdab03d84beaa702
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Jul 22 11:45:58 2020 +0200

    s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections
    
    Let's avoid memset(PAGE_UNUSED) when adding consecutive sections,
    whereby the vmemmap of a single section does not span full PMDs.
    
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Message-Id: <20200722094558.9828-10-david@redhat.com>
    Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

commit cd5781d63eaf6dbf89532d8c7c214786b767ee16
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Jul 22 11:45:57 2020 +0200

    s390/vmemmap: remember unused sub-pmd ranges
    
    With a memmap size of 56 bytes or 72 bytes per page, the memmap for a
    256 MB section won't span full PMDs. As we populate single sections and
    depopulate single sections, the depopulation step would not be able to
    free all vmemmap pmds anymore.
    
    Do it similarly to x86, marking the unused memmap ranges in a special way
    (pad it with 0xFD).
    
    This allows us to add/remove sections, cleaning up all allocated
    vmemmap pages even if the memmap size is not multiple of 16 bytes per page.
    
    A 56 byte memmap can, for example, be created with !CONFIG_MEMCG and
    !CONFIG_SLUB.
    
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Message-Id: <20200722094558.9828-9-david@redhat.com>
    Signed-off-by: Heiko Carstens <hca@linux.ibm.com>


But again, the root issue I see is that we can play such games with anonymous pages, but not with
memory residing in the altmap space.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 10:57       ` David Hildenbrand
@ 2021-01-25 11:18         ` Oscar Salvador
  2021-01-25 11:23           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2021-01-25 11:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Anshuman Khandual, akpm, mhocko, linux-kernel, linux-mm, vbabka,
	pasha.tatashin

On Mon, Jan 25, 2021 at 11:57:20AM +0100, David Hildenbrand wrote:
> I'm confused.
> 
> 1. Assume we hotplug memory, online it to ZONE_MOVABLE. The vmemmap gets
> allocated from altmap space.

The vmemmap could have never been allocated from altmap in case hpage vmemmap
feature is enabled.

Have a look at [1].
If is_hugetlb_free_vmemmap_enabled(), vmemmap_populate() ends up calling
vmemmap_populate_basepages().

And since no memory was consumed from altmap, and hence altmap_alloc_block_buf()
was never called, vmem_altmap->alloc will be 0, and memory_block->nr_vmemmap_pages
will be 0 as well.

But on a second though, true is that we will get in trouble if hpage vmemmap
feature ever gets to work with vmemmap_populate_hugepages.
I will queue that to look in a new future.

[1] https://patchwork.kernel.org/project/linux-mm/patch/20210117151053.24600-10-songmuchun@bytedance.com/


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 11:18         ` Oscar Salvador
@ 2021-01-25 11:23           ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2021-01-25 11:23 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Anshuman Khandual, akpm, mhocko, linux-kernel, linux-mm, vbabka,
	pasha.tatashin

On 25.01.21 12:18, Oscar Salvador wrote:
> On Mon, Jan 25, 2021 at 11:57:20AM +0100, David Hildenbrand wrote:
>> I'm confused.
>>
>> 1. Assume we hotplug memory, online it to ZONE_MOVABLE. The vmemmap gets
>> allocated from altmap space.
> 
> The vmemmap could have never been allocated from altmap in case hpage vmemmap
> feature is enabled.
> 
> Have a look at [1].
> If is_hugetlb_free_vmemmap_enabled(), vmemmap_populate() ends up calling
> vmemmap_populate_basepages().

Oh, it calls "vmemmap_populate_basepages(start, end, node, NULL);"

the "NULL" part is the important bit.

> 
> And since no memory was consumed from altmap, and hence altmap_alloc_block_buf()
> was never called, vmem_altmap->alloc will be 0, and memory_block->nr_vmemmap_pages
> will be 0 as well.
> 
> But on a second though, true is that we will get in trouble if hpage vmemmap
> feature ever gets to work with vmemmap_populate_hugepages.
> I will queue that to look in a new future.

This seriously needs comments and documentation. The problem is where to
document as long as one of both series is not merged yet :)

At least in the cover letter, because this is not obvious how both
things will play along.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-01-25 11:02         ` David Hildenbrand
@ 2021-01-25 13:36           ` Oscar Salvador
  0 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2021-01-25 13:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Mon, Jan 25, 2021 at 12:02:53PM +0100, David Hildenbrand wrote:
> Assume you have two consecutive memory blocks with 56 sizeof(struct page).
> The first one allocates a PMD (2097152) but only consumes 1835008, the second
> one reuses the remaining part and allocates another PMD (1835008),
> only using parts of it.
> 
> Ripping out a memory block, along with the PMD in the vmemmap would
> remove parts of the vmemmap of another memory block.

Bleh, yeah, I was confused, you are right.

> You might want to take a look at:

Thanks a lot for the hints, I will have a look ;-)


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-26 19:45 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 13:07 [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador
2020-12-17 13:07 ` [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
2021-01-11 16:52   ` David Hildenbrand
2021-01-12  7:26     ` Oscar Salvador
2021-01-12 10:12       ` David Hildenbrand
2021-01-12 11:17         ` Oscar Salvador
2021-01-12 11:17           ` David Hildenbrand
2020-12-17 13:07 ` [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2021-01-18 10:11   ` Oscar Salvador
2021-01-19 13:58   ` David Hildenbrand
2021-01-25 10:39     ` Oscar Salvador
2021-01-25 10:56       ` Oscar Salvador
2021-01-25 11:02         ` David Hildenbrand
2021-01-25 13:36           ` Oscar Salvador
2021-01-25 10:57       ` David Hildenbrand
2021-01-25 11:18         ` Oscar Salvador
2021-01-25 11:23           ` David Hildenbrand
2020-12-17 13:07 ` [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
2021-01-11 16:53   ` David Hildenbrand
2020-12-17 13:07 ` [PATCH 4/5] powerpc/memhotplug: " Oscar Salvador
2021-01-11 16:55   ` David Hildenbrand
2021-01-12 13:31     ` Oscar Salvador
2020-12-17 13:07 ` [PATCH 5/5] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
2021-01-11  9:05 ` [PATCH 0/5] Allocate memmap from hotadded memory (per device) Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).