linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device)
@ 2020-12-01 11:51 Oscar Salvador
  2020-12-01 11:51 ` [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory Oscar Salvador
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:51 UTC (permalink / raw)
  To: david
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin, Oscar Salvador

This is v3 of [1]:

Changes from v2 -> v3:
 - Re-order patches (Michal)
 - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
 - Add kernel boot option to enable this feature (Michal)

Changes from v1 -> v2:
 - Addressed feedback provided by David
 - Add a arch_support_memmap_on_memory to be called
   from mhp_supports_memmap_on_memory, as atm,
   only ARM, powerpc and x86_64 have altmat support.


Original cover letter:

----

The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
The current way we use to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.
c) due to fragmentation we might end up populating memmap with base
   pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate.
memory_block structure gained a new field called nr_vmemmap_pages.
This plays well for two reasons:

 1) {offline/online}_pages know the difference between start_pfn and
    buddy_start_pfn, which is start_pfn + nr_vmemmap_pages.
    In this way all isolation/migration operations are
    done to within the right range of memory without vmemmap pages.
    This allows us for a much cleaner handling.

 2) In try_remove_memory, we construct a new vmemap_altmap struct with the
    right information based on memory_block->nr_vmemap_pages, so we end up
    calling vmem_altmap_free instead of free_pagetable when removing the memory.

Oscar Salvador (4):
  mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  mm,memory_hotplug: Allocate memmap from the added memory range
  mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  mm,memory_hotplug: Add mhp_memmap_on_memory boot option

 arch/arm64/mm/mmu.c                           |   5 +
 arch/powerpc/mm/mem.c                         |   5 +
 .../platforms/pseries/hotplug-memory.c        |   5 +-
 arch/x86/mm/init_64.c                         |   5 +
 drivers/acpi/acpi_memhotplug.c                |   5 +-
 drivers/base/memory.c                         |  21 +--
 include/linux/memory.h                        |   7 +-
 include/linux/memory_hotplug.h                |  22 ++-
 include/linux/memremap.h                      |   2 +-
 mm/memory_hotplug.c                           | 127 +++++++++++++++---
 mm/page_alloc.c                               |   4 +-
 11 files changed, 171 insertions(+), 37 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  2020-12-01 11:51 [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device) Oscar Salvador
@ 2020-12-01 11:51 ` Oscar Salvador
  2020-12-02  9:36   ` David Hildenbrand
  2020-12-01 11:51 ` [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:51 UTC (permalink / raw)
  To: david
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin, Oscar Salvador

mhp_supports_memmap_on_memory is meant to be used by the caller prior
to hot-adding memory in order to figure out whether it can enable
MHP_MEMMAP_ON_MEMORY or not.

Enabling MHP_MEMMAP_ON_MEMORY requires:

 - memmap_on_memory_enabled is set (by mhp_memmap_on_memory kernel boot option)
 - CONFIG_SPARSEMEM_VMEMMAP
 - architecture support for altmap
 - hot-added range spans a single memory block

Note that mhp_memmap_on_memory kernel boot option will be added in
a coming patch.

At the moment, only three architectures support passing altmap when
building the page tables: x86, POWERPC and ARM.
Define an arch_support_memmap_on_memory function on those architectures
that returns true, and define a __weak variant of it that will be used
on the others.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c            |  5 +++++
 arch/powerpc/mm/mem.c          |  5 +++++
 arch/x86/mm/init_64.c          |  5 +++++
 include/linux/memory_hotplug.h | 10 ++++++++++
 mm/memory_hotplug.c            | 24 ++++++++++++++++++++++++
 5 files changed, 49 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 30c6dd02e706..8a33ac97dcbb 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1454,6 +1454,11 @@ static bool inside_linear_region(u64 start, u64 size)
 	       (start + size - 1) <= __pa(PAGE_END - 1);
 }
 
+bool arch_support_memmap_on_memory(void)
+{
+	return true;
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 		    struct mhp_params *params)
 {
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0694bdcce653..c5ef015c3189 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -181,6 +181,11 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(start_pfn, nr_pages, altmap);
 	arch_remove_linear_mapping(start, size);
 }
+
+bool arch_support_memmap_on_memory(void)
+{
+	return true;
+}
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..ffb9d87c77e8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -860,6 +860,11 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	return ret;
 }
 
+bool arch_support_memmap_on_memory(void)
+{
+	return true;
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 		    struct mhp_params *params)
 {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 15acce5ab106..a54310abee79 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -70,6 +70,14 @@ typedef int __bitwise mhp_t;
  */
 #define MEMHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -129,6 +137,7 @@ extern int try_online_node(int nid);
 
 extern int arch_add_memory(int nid, u64 start, u64 size,
 			   struct mhp_params *params);
+extern bool arch_support_memmap_on_memory(void);
 extern u64 max_mem_size;
 
 extern int memhp_online_type_from_str(const char *str);
@@ -361,6 +370,7 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 		unsigned long nr_pages);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a8cef4955907..e3c310225a60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1011,6 +1011,20 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+bool __weak arch_support_memmap_on_memory(void)
+{
+	return false;
+}
+
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	if (!arch_support_memmap_on_memory() ||
+	    !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) ||
+	    size > memory_block_size_bytes())
+		return false;
+	return true;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1046,6 +1060,16 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 	new_node = ret;
 
+	/*
+	 * Return -EINVAL if caller specified MHP_MEMMAP_ON_MEMORY and we do
+	 * not support it.
+	 */
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+	    !mhp_supports_memmap_on_memory(size)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-01 11:51 [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2020-12-01 11:51 ` [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory Oscar Salvador
@ 2020-12-01 11:51 ` Oscar Salvador
  2020-12-02 10:05   ` David Hildenbrand
  2020-12-01 11:51 ` [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
  2020-12-01 11:51 ` [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option Oscar Salvador
  3 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:51 UTC (permalink / raw)
  To: david
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin, Oscar Salvador

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This comes in handy as in {online,offline}_pages, all the isolation and
migration is being one on (buddy_start_pfn, end_pfn] range,
being buddy_start_pfn = start_pfn + nr_vmemmap_pages.

In this way, we have:

(start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
(buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

Hot-remove:

 If the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does
 the right thing and calls vmem_altmap_free instead of
 free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c          | 21 +++++---
 include/linux/memory.h         |  7 ++-
 include/linux/memory_hotplug.h | 12 +++--
 include/linux/memremap.h       |  2 +-
 mm/memory_hotplug.c            | 94 ++++++++++++++++++++++++++--------
 mm/page_alloc.c                |  4 +-
 6 files changed, 105 insertions(+), 35 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 00a2f7191357..ed68fe267837 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -175,7 +175,7 @@ int memory_notify(unsigned long val, void *v)
  */
 static int
 memory_block_action(unsigned long start_section_nr, unsigned long action,
-		    int online_type, int nid)
+		    int online_type, int nid, unsigned long nr_vmemmap_pages)
 {
 	unsigned long start_pfn;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
+				   online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages);
+		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
@@ -211,7 +212,7 @@ static int memory_block_change_state(struct memory_block *mem,
 		mem->state = MEM_GOING_OFFLINE;
 
 	ret = memory_block_action(mem->start_section_nr, to_state,
-				  mem->online_type, mem->nid);
+				  mem->online_type, mem->nid, mem->nr_vmemmap_pages);
 
 	mem->state = ret ? from_state_req : to_state;
 
@@ -571,7 +572,8 @@ int register_memory(struct memory_block *memory)
 	return ret;
 }
 
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+			     unsigned long nr_vmemmap_pages)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
@@ -591,6 +593,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 	mem->nid = NUMA_NO_NODE;
+	mem->nr_vmemmap_pages = nr_vmemmap_pages;
 
 	ret = register_memory(mem);
 
@@ -610,7 +613,7 @@ static int add_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -632,7 +635,8 @@ static void unregister_memory(struct memory_block *memory)
  *
  * Called under device_hotplug_lock.
  */
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
 		if (ret)
 			break;
+		vmemmap_pages = 0;
 	}
 	if (ret) {
 		end_block_id = block_id;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 439a89e758d8..b910d2aea879 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -30,6 +30,11 @@ struct memory_block {
 	int phys_device;		/* to which fru does this belong? */
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	unsigned long nr_vmemmap_pages;	/*
+					 * Number of vmemmap pages. These pages
+					 * lay at the beginning of the memory
+					 * block.
+					 */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -81,7 +86,7 @@ static inline int memory_notify(unsigned long val, void *v)
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size, unsigned long vmemmap_pages);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a54310abee79..b828ddf07682 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -121,11 +121,13 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			int online_type, int nid);
+			unsigned long nr_vmemmap_pages, int online_type,
+			int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn);
+				     unsigned long end_pfn,
+				     unsigned long buddy_start_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
@@ -321,7 +323,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 unsigned long nr_vmemmap_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
@@ -329,7 +332,8 @@ extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 #else
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				unsigned long nr_vmemmap_pages)
 {
 	return -EINVAL;
 }
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..3de5d482ac1a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,7 +17,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e3c310225a60..4b4708512f82 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -606,10 +606,12 @@ void generic_online_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+			       unsigned long buddy_start_pfn)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long pfn;
+	unsigned int order;
 
 	/*
 	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -617,8 +619,12 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
 	 */
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
-		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
+	for (pfn = buddy_start_pfn; pfn < end_pfn; pfn += (1 << order)) {
+		order = MAX_ORDER - 1;
+		while (pfn & ((1 << order) - 1))
+			order--;
+		(*online_page_callback)(pfn_to_page(pfn), order);
+	}
 
 	/* mark all involved sections as online */
 	online_mem_sections(start_pfn, end_pfn);
@@ -678,6 +684,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 	pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
 
 }
+
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
  * and resizing the pgdat/zone data to span the added pages. After this
@@ -777,9 +784,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+		       unsigned long nr_vmemmap_pages, int online_type, int nid)
 {
-	unsigned long flags;
+	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
@@ -790,11 +797,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
+	if (nr_vmemmap_pages)
+		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
+				       MIGRATE_UNMOVABLE);
+	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+			       MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
@@ -810,7 +824,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
@@ -823,7 +837,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}
 
-	online_pages_range(pfn, nr_pages);
+	online_pages_range(pfn, nr_pages, buddy_start_pfn);
 	zone->present_pages += nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -836,7 +850,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+	undo_isolate_page_range(buddy_start_pfn,
+				buddy_start_pfn + buddy_nr_pages,
+				MIGRATE_MOVABLE);
 
 	/*
 	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1034,6 +1050,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+	struct vmem_altmap mhp_altmap = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1070,13 +1087,22 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 	}
 
+	/*
+	 * Self hosted memmap array
+	 */
+	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+		mhp_altmap.free = size >> PAGE_SHIFT;
+		mhp_altmap.base_pfn = start >> PAGE_SHIFT;
+		params.altmap = &mhp_altmap;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
 		arch_remove_memory(nid, start, size, NULL);
 		goto error;
@@ -1472,10 +1498,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			unsigned long nr_vmemmap_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, system_ram_pages = 0;
+	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1487,6 +1514,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/*
@@ -1522,7 +1552,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone_pcp_disable(zone);
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
+	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1542,7 +1572,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 
 	do {
-		pfn = start_pfn;
+		pfn = buddy_start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1573,18 +1603,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}
 
-		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
 
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
 	pr_info("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -1593,13 +1623,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
 	zone->present_pages -= nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1659,6 +1689,16 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
+static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
+	int ret = !mem->nr_vmemmap_pages;
+
+	if (!ret)
+		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
+	return ret;
+}
+
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1733,6 +1773,9 @@ EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	unsigned long nr_vmemmap_pages = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1745,6 +1788,17 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 
+	/*
+	 * Prepare a vmem_altmap struct if we used it at hot-add, so
+	 * remove_pmd_table->free_hugepage_table does the right thing.
+	 */
+	(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
+	     get_memblock_vmemmap_pages_cb);
+	if (nr_vmemmap_pages) {
+		mhp_altmap.alloc = nr_vmemmap_pages;
+		altmap = &mhp_altmap;
+	}
+
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1756,7 +1810,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, NULL);
+	arch_remove_memory(nid, start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f91df593bf71..818fb0223c35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8856,7 +8856,8 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
+			      unsigned long buddy_start_pfn)
 {
 	unsigned long pfn = start_pfn;
 	struct page *page;
@@ -8867,6 +8868,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
+	pfn = buddy_start_pfn;
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
 		/*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-01 11:51 [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2020-12-01 11:51 ` [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory Oscar Salvador
  2020-12-01 11:51 ` [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2020-12-01 11:51 ` Oscar Salvador
  2020-12-02  9:37   ` David Hildenbrand
  2020-12-01 11:51 ` [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option Oscar Salvador
  3 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:51 UTC (permalink / raw)
  To: david
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin, Oscar Salvador

Let the callers check whether they can use MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
Currently, we only support MHP_MEMMAP_ON_MEMORY in case
CONFIG_SPARSEMEM_VMEMMAP is enabled, the architecture supports altmap,
and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 5 ++++-
 drivers/acpi/acpi_memhotplug.c                  | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 7efe6ec5d14a..3aba817e729e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -612,6 +612,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
 	unsigned long block_sz;
 	int nid, rc;
+	mhp_t mhp_flags = MHP_NONE;
 
 	if (lmb->flags & DRCONF_MEM_ASSIGNED)
 		return -EINVAL;
@@ -629,8 +630,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 	if (nid < 0 || !node_possible(nid))
 		nid = first_online_node;
 
+	if (mhp_supports_memmap_on_memory(block_sz))
+		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 	/* Add the memory */
-	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..1fe645ef0b6c 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -172,6 +172,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
 	int node;
+	mhp_t mhp_flags = MHP_NONE;
 
 	node = acpi_get_node(handle);
 	/*
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
+		if (mhp_supports_memmap_on_memory(info->length))
+			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(node, info->start_addr, info->length,
-				      MHP_NONE);
+				      mhp_flags);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option
  2020-12-01 11:51 [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (2 preceding siblings ...)
  2020-12-01 11:51 ` [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
@ 2020-12-01 11:51 ` Oscar Salvador
  2020-12-02  9:42   ` David Hildenbrand
  3 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:51 UTC (permalink / raw)
  To: david
  Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin, Oscar Salvador

Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks.

Make this an opt-in which needs to be explicitly enabled.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory_hotplug.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4b4708512f82..858d6161e915 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
 #include "internal.h"
 #include "shuffle.h"
 
+static bool memmap_on_memory_enabled __read_mostly = false;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -1034,7 +1036,7 @@ bool __weak arch_support_memmap_on_memory(void)
 
 bool mhp_supports_memmap_on_memory(unsigned long size)
 {
-	if (!arch_support_memmap_on_memory() ||
+	if (!memmap_on_memory_enabled || !arch_support_memmap_on_memory() ||
 	    !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) ||
 	    size > memory_block_size_bytes())
 		return false;
@@ -1422,6 +1424,13 @@ static int __init cmdline_parse_movable_node(char *p)
 }
 early_param("movable_node", cmdline_parse_movable_node);
 
+static int __init cmdline_parse_memmap_on_memory(char *p)
+{
+	memmap_on_memory_enabled = true;
+	return 0;
+}
+early_param("mhp_memmap_on_memory", cmdline_parse_memmap_on_memory);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  2020-12-01 11:51 ` [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory Oscar Salvador
@ 2020-12-02  9:36   ` David Hildenbrand
  2020-12-09  9:36     ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-12-02  9:36 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 01.12.20 12:51, Oscar Salvador wrote:
> mhp_supports_memmap_on_memory is meant to be used by the caller prior
> to hot-adding memory in order to figure out whether it can enable
> MHP_MEMMAP_ON_MEMORY or not.
> 
> Enabling MHP_MEMMAP_ON_MEMORY requires:
> 
>  - memmap_on_memory_enabled is set (by mhp_memmap_on_memory kernel boot option)
>  - CONFIG_SPARSEMEM_VMEMMAP
>  - architecture support for altmap
>  - hot-added range spans a single memory block

Instead of adding these arch callbacks, what about a config option

ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

that gets selected by the archs with CONFIG_SPARSEMEM_VMEMMAP ?

The mhp_supports_memmap_on_memory() becomes even more trivial.


> 
> Note that mhp_memmap_on_memory kernel boot option will be added in
> a coming patch.

I think it makes sense to

a) separate off the arch changes into separate patches, clarifying why
it can be used. Move this patches to the end of the series.

b) Squashing the remainings into patch #2

> 
> At the moment, only three architectures support passing altmap when
> building the page tables: x86, POWERPC and ARM.
> Define an arch_support_memmap_on_memory function on those architectures
> that returns true, and define a __weak variant of it that will be used
> on the others.

[...]

> +/*
> + * We want memmap (struct page array) to be self contained.
> + * To do so, we will use the beginning of the hot-added range to build
> + * the page tables for the memmap array that describes the entire range.
> + * Only selected architectures support it with SPARSE_VMEMMAP.

You might want to add how the caller can calculate the necessary size
and that that this calculated piece of memory to be added will be
accessed before onlining these pages. This is e.g., relevant if
virtio-mem, the hyper-v balloon, or xen balloon would want to use this
mechanism. Also, it's somewhat incompatible with standby memory where
memory cannot be accessed prior to onlining. So pointing that access out
might be valuable.

> + */
> +#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
> +
>  /*
>   * Extended parameters for memory hotplug:
>   * altmap: alternative allocator for memmap array (optional)
> @@ -129,6 +137,7 @@ extern int try_online_node(int nid);
>  
>  extern int arch_add_memory(int nid, u64 start, u64 size,
>  			   struct mhp_params *params);
> +extern bool arch_support_memmap_on_memory(void);
>  extern u64 max_mem_size;
>  
>  extern int memhp_online_type_from_str(const char *str);
> @@ -361,6 +370,7 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
>  					  unsigned long pnum);
>  extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>  		unsigned long nr_pages);
> +extern bool mhp_supports_memmap_on_memory(unsigned long size);
>  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>  				      struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a8cef4955907..e3c310225a60 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1011,6 +1011,20 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  	return device_online(&mem->dev);
>  }
>  
> +bool __weak arch_support_memmap_on_memory(void)
> +{
> +	return false;
> +}
> +
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	if (!arch_support_memmap_on_memory() ||
> +	    !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) ||
> +	    size > memory_block_size_bytes())
> +		return false;
> +	return true;

You can simplify to

return arch_support_memmap_on_memory() &&
       IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) &&
       size == memory_block_size_bytes();


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-01 11:51 ` [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
@ 2020-12-02  9:37   ` David Hildenbrand
  2020-12-09  9:38     ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-12-02  9:37 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 01.12.20 12:51, Oscar Salvador wrote:
> Let the callers check whether they can use MHP_MEMMAP_ON_MEMORY by
> checking mhp_supports_memmap_on_memory().
> Currently, we only support MHP_MEMMAP_ON_MEMORY in case
> CONFIG_SPARSEMEM_VMEMMAP is enabled, the architecture supports altmap,
> and the range to be added spans a single memory block.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 5 ++++-
>  drivers/acpi/acpi_memhotplug.c                  | 5 ++++-
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 7efe6ec5d14a..3aba817e729e 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -612,6 +612,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  {
>  	unsigned long block_sz;
>  	int nid, rc;
> +	mhp_t mhp_flags = MHP_NONE;
>  
>  	if (lmb->flags & DRCONF_MEM_ASSIGNED)
>  		return -EINVAL;
> @@ -629,8 +630,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  	if (nid < 0 || !node_possible(nid))
>  		nid = first_online_node;
>  
> +	if (mhp_supports_memmap_on_memory(block_sz))
> +		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  	/* Add the memory */
> -	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> +	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>  	if (rc) {
>  		invalidate_lmb_associativity_index(lmb);
>  		return rc;
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index b02fd51e5589..1fe645ef0b6c 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -172,6 +172,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  	int result, num_enabled = 0;
>  	struct acpi_memory_info *info;
>  	int node;
> +	mhp_t mhp_flags = MHP_NONE;
>  
>  	node = acpi_get_node(handle);
>  	/*
> @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (node < 0)
>  			node = memory_add_physaddr_to_nid(info->start_addr);
>  
> +		if (mhp_supports_memmap_on_memory(info->length))
> +			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  		result = __add_memory(node, info->start_addr, info->length,
> -				      MHP_NONE);
> +				      mhp_flags);
>  
>  		/*
>  		 * If the memory block has been used by the kernel, add_memory()
> 

Please split that patch into two parts, one for each subsystem.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option
  2020-12-01 11:51 ` [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option Oscar Salvador
@ 2020-12-02  9:42   ` David Hildenbrand
  2020-12-09 10:02     ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-12-02  9:42 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 01.12.20 12:51, Oscar Salvador wrote:
> Self stored memmap leads to a sparse memory situation which is unsuitable
> for workloads that requires large contiguous memory chunks.
> 
> Make this an opt-in which needs to be explicitly enabled.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/memory_hotplug.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4b4708512f82..858d6161e915 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +static bool memmap_on_memory_enabled __read_mostly = false;
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -1034,7 +1036,7 @@ bool __weak arch_support_memmap_on_memory(void)
>  
>  bool mhp_supports_memmap_on_memory(unsigned long size)
>  {
> -	if (!arch_support_memmap_on_memory() ||
> +	if (!memmap_on_memory_enabled || !arch_support_memmap_on_memory() ||
>  	    !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) ||
>  	    size > memory_block_size_bytes())
>  		return false;
> @@ -1422,6 +1424,13 @@ static int __init cmdline_parse_movable_node(char *p)
>  }
>  early_param("movable_node", cmdline_parse_movable_node);
>  
> +static int __init cmdline_parse_memmap_on_memory(char *p)
> +{
> +	memmap_on_memory_enabled = true;
> +	return 0;
> +}
> +early_param("mhp_memmap_on_memory", cmdline_parse_memmap_on_memory);
> +
>  /* check which state of node_states will be changed when offline memory */
>  static void node_states_check_changes_offline(unsigned long nr_pages,
>  		struct zone *zone, struct memory_notify *arg)
> 

I have another memhp tunable in the works. I suggest doing it like
page_shuffling and using, module parameters instead. Makes this
a bit nicer IMHO.


diff --git a/mm/Makefile b/mm/Makefile
index 069f216e109e..ba7714b5eaa1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y                        := filemap.o mempool.o oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give "memory_hotplug" its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) := memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
        obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
@@ -82,7 +86,6 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)    += kasan/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)          += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o


The you can just use module_param/MODULE_PARM_DESC and set the parameter via

"memory_hotplug.memmap_on_memory"

-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-01 11:51 ` [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2020-12-02 10:05   ` David Hildenbrand
  2020-12-09 10:32     ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-12-02 10:05 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 01.12.20 12:51, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>     populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being one on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

If you take a look at generic_online_page() there are some things that
won't be done for our vmemmap pages

1. kernel_map_pages(page, 1 << order, 1);

We're accessing these pages already when initializing the memmap. We
might have to explicitly map these vmemmap pages at some point. Might
require some thought. Did you test with debug pagealloc?

2. totalram_pages_add(1UL << order);

We should add/remove the vmemmap pages manually from totalram I guess.


Also, I wonder if we should add these pages to the zone managed page
count. Not quite sure yet what the benefit would be and if we care at
all. (it's a change in behavior when hotplugging memory, though, that's
why I am poining it out)

> 
> Hot-remove:
> 
>  If the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does
>  the right thing and calls vmem_altmap_free instead of
>  free_pagetable.
> 

You might want to update the comment in include/linux/mmzone.h for
ZONE_MOVABLE, adding this special case when onlining such memory to the
MOVABLE zone. The pages are movable for offlining only, not for allocations.

[...]
> -int create_memory_block_devices(unsigned long start, unsigned long size)
> +int create_memory_block_devices(unsigned long start, unsigned long size,
> +				unsigned long vmemmap_pages)
>  {
>  	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
>  	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
> @@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
>  		return -EINVAL;
>  
>  	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
> -		ret = init_memory_block(block_id, MEM_OFFLINE);
> +		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
>  		if (ret)
>  			break;
> +		vmemmap_pages = 0;

We should even bail out if someone would have vmemmap_pages set when
creating more than one memory block device.

>  	}
>  	if (ret) {
>  		end_block_id = block_id;
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 439a89e758d8..b910d2aea879 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -30,6 +30,11 @@ struct memory_block {
>  	int phys_device;		/* to which fru does this belong? */
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	unsigned long nr_vmemmap_pages;	/*
> +					 * Number of vmemmap pages. These pages
> +					 * lay at the beginning of the memory
> +					 * block.
> +					 */
>  };
>  
>  int arch_get_memory_phys_device(unsigned long start_pfn);
> @@ -81,7 +86,7 @@ static inline int memory_notify(unsigned long val, void *v)
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> -int create_memory_block_devices(unsigned long start, unsigned long size);
> +int create_memory_block_devices(unsigned long start, unsigned long size, unsigned long vmemmap_pages);
>  void remove_memory_block_devices(unsigned long start, unsigned long size);
>  extern void memory_dev_init(void);
>  extern int memory_notify(unsigned long val, void *v);
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index a54310abee79..b828ddf07682 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -121,11 +121,13 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
>  extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
>  /* VM interface that may be used by firmware interface */
>  extern int online_pages(unsigned long pfn, unsigned long nr_pages,
> -			int online_type, int nid);
> +			unsigned long nr_vmemmap_pages, int online_type,
> +			int nid);
>  extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
>  					 unsigned long end_pfn);
>  extern void __offline_isolated_pages(unsigned long start_pfn,
> -				     unsigned long end_pfn);
> +				     unsigned long end_pfn,
> +				     unsigned long buddy_start_pfn);
>  
>  typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
>  
> @@ -321,7 +323,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  
>  extern void try_offline_node(int nid);
> -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> +extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			 unsigned long nr_vmemmap_pages);
>  extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
>  extern int offline_and_remove_memory(int nid, u64 start, u64 size);
> @@ -329,7 +332,8 @@ extern int offline_and_remove_memory(int nid, u64 start, u64 size);
>  #else
>  static inline void try_offline_node(int nid) {}
>  
> -static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +				unsigned long nr_vmemmap_pages)
>  {
>  	return -EINVAL;
>  }
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 86c6c368ce9b..3de5d482ac1a 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,7 +17,7 @@ struct device;
>   * @alloc: track pages consumed, private to vmemmap_populate()
>   */
>  struct vmem_altmap {
> -	const unsigned long base_pfn;
> +	unsigned long base_pfn;
>  	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e3c310225a60..4b4708512f82 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -606,10 +606,12 @@ void generic_online_page(struct page *page, unsigned int order)
>  }
>  EXPORT_SYMBOL_GPL(generic_online_page);
>  
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long buddy_start_pfn)
>  {
>  	const unsigned long end_pfn = start_pfn + nr_pages;
>  	unsigned long pfn;
> +	unsigned int order;
>  
>  	/*
>  	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
> @@ -617,8 +619,12 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
>  	 * later). We account all pages as being online and belonging to this
>  	 * zone ("present").
>  	 */
> -	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> -		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> +	for (pfn = buddy_start_pfn; pfn < end_pfn; pfn += (1 << order)) {
> +		order = MAX_ORDER - 1;
> +		while (pfn & ((1 << order) - 1))
> +			order--;
> +		(*online_page_callback)(pfn_to_page(pfn), order);
> +	}

I think it would be nicer to handle the unaligned part in a separate loop.

>  
>  	/* mark all involved sections as online */
>  	online_mem_sections(start_pfn, end_pfn);
> @@ -678,6 +684,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
>  	pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
>  
>  }
> +

Unrelated change.
[...]

> +static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> +{
> +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> +	int ret = !mem->nr_vmemmap_pages;
> +
> +	if (!ret)
> +		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> +	return ret;

When removing in the same granularity than adding, all we care about it
the first memory block.


unsigned long *nr_vmemmap_pages = (unsigned long *)arg;

*nr_vmemmap_pages = mem->nr_vmemmap_pages;
/* We only care about the first memory block. */
return 1;


Removing in different granularity doesn't work at all.

> +}
> +
>  static int check_cpu_on_node(pg_data_t *pgdat)
>  {
>  	int cpu;
> @@ -1733,6 +1773,9 @@ EXPORT_SYMBOL(try_offline_node);
>  static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  {
>  	int rc = 0;
> +	struct vmem_altmap mhp_altmap = {};
> +	struct vmem_altmap *altmap = NULL;
> +	unsigned long nr_vmemmap_pages = 0;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -1745,6 +1788,17 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  	if (rc)
>  		return rc;
>  
> +	/*
> +	 * Prepare a vmem_altmap struct if we used it at hot-add, so
> +	 * remove_pmd_table->free_hugepage_table does the right thing.
> +	 */
> +	(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
> +	     get_memblock_vmemmap_pages_cb);
> +	if (nr_vmemmap_pages) {
> +		mhp_altmap.alloc = nr_vmemmap_pages;
> +		altmap = &mhp_altmap;
> +	}
> +

If someone would remove_memory() in a different granularity than
add_memory(), this would no longer work. How can we catch that
efficiently? Or at least document that this is not supported with
memmap_on_memory).


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  2020-12-02  9:36   ` David Hildenbrand
@ 2020-12-09  9:36     ` Oscar Salvador
  2020-12-09  9:40       ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09  9:36 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 02, 2020 at 10:36:54AM +0100, David Hildenbrand wrote:
> Instead of adding these arch callbacks, what about a config option
> 
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> 
> that gets selected by the archs with CONFIG_SPARSEMEM_VMEMMAP ?
> 
> The mhp_supports_memmap_on_memory() becomes even more trivial.

I think that would not be enough.
E.g: s390x supports CONFIG_SPARSEMEM_VMEMMAP but it does not support
altmap (and maybe other arches I did not check too).
That is why I was careful in choosing the ones that a) supports
CONFIG_SPARSEMEM_VMEMMAP and b) support altmap

> > Note that mhp_memmap_on_memory kernel boot option will be added in
> > a coming patch.
> 
> I think it makes sense to
> 
> a) separate off the arch changes into separate patches, clarifying why
> it can be used. Move this patches to the end of the series.
> 
> b) Squashing the remainings into patch #2

Ok, I can do that.

> > +/*
> > + * We want memmap (struct page array) to be self contained.
> > + * To do so, we will use the beginning of the hot-added range to build
> > + * the page tables for the memmap array that describes the entire range.
> > + * Only selected architectures support it with SPARSE_VMEMMAP.
> 
> You might want to add how the caller can calculate the necessary size
> and that that this calculated piece of memory to be added will be
> accessed before onlining these pages. This is e.g., relevant if
> virtio-mem, the hyper-v balloon, or xen balloon would want to use this
> mechanism. Also, it's somewhat incompatible with standby memory where
> memory cannot be accessed prior to onlining. So pointing that access out
> might be valuable.

Sure, I will be more verbose.

> You can simplify to
> 
> return arch_support_memmap_on_memory() &&
>        IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) &&
>        size == memory_block_size_bytes();

Yeah, thanks ;-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2020-12-02  9:37   ` David Hildenbrand
@ 2020-12-09  9:38     ` Oscar Salvador
  0 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09  9:38 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 02, 2020 at 10:37:23AM +0100, David Hildenbrand wrote:
 
> Please split that patch into two parts, one for each subsystem.

I did not feel the need but if it eases the review, why not :-)


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  2020-12-09  9:36     ` Oscar Salvador
@ 2020-12-09  9:40       ` David Hildenbrand
  2020-12-09  9:43         ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-12-09  9:40 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 09.12.20 10:36, Oscar Salvador wrote:
> On Wed, Dec 02, 2020 at 10:36:54AM +0100, David Hildenbrand wrote:
>> Instead of adding these arch callbacks, what about a config option
>>
>> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>>
>> that gets selected by the archs with CONFIG_SPARSEMEM_VMEMMAP ?
>>
>> The mhp_supports_memmap_on_memory() becomes even more trivial.
> 
> I think that would not be enough.
> E.g: s390x supports CONFIG_SPARSEMEM_VMEMMAP but it does not support
> altmap (and maybe other arches I did not check too).
> That is why I was careful in choosing the ones that a) supports
> CONFIG_SPARSEMEM_VMEMMAP and b) support altmap

Sorry if I was unclear, s390x will simply not set
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory
  2020-12-09  9:40       ` David Hildenbrand
@ 2020-12-09  9:43         ` Oscar Salvador
  0 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09  9:43 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 09, 2020 at 10:40:13AM +0100, David Hildenbrand wrote:
> Sorry if I was unclear, s390x will simply not set
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

Bleh, that makes sense now.
I'm in a monday.. 

Thanks David

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option
  2020-12-02  9:42   ` David Hildenbrand
@ 2020-12-09 10:02     ` Oscar Salvador
  2020-12-09 10:04       ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09 10:02 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 02, 2020 at 10:42:18AM +0100, David Hildenbrand wrote:
> I have another memhp tunable in the works. I suggest doing it like
> page_shuffling and using, module parameters instead. Makes this
> a bit nicer IMHO.

Does that have any impact?

> diff --git a/mm/Makefile b/mm/Makefile
> index 069f216e109e..ba7714b5eaa1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -58,9 +58,13 @@ obj-y                        := filemap.o mempool.o oom_kill.o fadvise.o \
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>  
> +# Give "memory_hotplug" its own module-parameter namespace
> +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) := memory_hotplug.o
> +
>  obj-y += page-alloc.o
>  obj-y += init-mm.o
>  obj-y += memblock.o
> +obj-y += $(memory-hotplug-y)
>  
>  ifdef CONFIG_MMU
>         obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
> @@ -82,7 +86,6 @@ obj-$(CONFIG_SLAB) += slab.o
>  obj-$(CONFIG_SLUB) += slub.o
>  obj-$(CONFIG_KASAN)    += kasan/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
> -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_MEMTEST)          += memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> 
> 
> The you can just use module_param/MODULE_PARM_DESC and set the parameter via
> 
> "memory_hotplug.memmap_on_memory"

I have to confess that I was not aware of this trick, but looks cleaner
overall.

Thanks

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option
  2020-12-09 10:02     ` Oscar Salvador
@ 2020-12-09 10:04       ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-12-09 10:04 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On 09.12.20 11:02, Oscar Salvador wrote:
> On Wed, Dec 02, 2020 at 10:42:18AM +0100, David Hildenbrand wrote:
>> I have another memhp tunable in the works. I suggest doing it like
>> page_shuffling and using, module parameters instead. Makes this
>> a bit nicer IMHO.
> 
> Does that have any impact?

Not that I am aware of for our use case. You can inspect parameters via

/sys/modules/memory_hotplug/parameters/

then, and even change them (if allowed for a specific module parameters)

> 
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 069f216e109e..ba7714b5eaa1 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -58,9 +58,13 @@ obj-y                        := filemap.o mempool.o oom_kill.o fadvise.o \
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>>  
>> +# Give "memory_hotplug" its own module-parameter namespace
>> +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) := memory_hotplug.o
>> +
>>  obj-y += page-alloc.o
>>  obj-y += init-mm.o
>>  obj-y += memblock.o
>> +obj-y += $(memory-hotplug-y)
>>  
>>  ifdef CONFIG_MMU
>>         obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
>> @@ -82,7 +86,6 @@ obj-$(CONFIG_SLAB) += slab.o
>>  obj-$(CONFIG_SLUB) += slub.o
>>  obj-$(CONFIG_KASAN)    += kasan/
>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>> -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>>  obj-$(CONFIG_MEMTEST)          += memtest.o
>>  obj-$(CONFIG_MIGRATION) += migrate.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>
>>
>> The you can just use module_param/MODULE_PARM_DESC and set the parameter via
>>
>> "memory_hotplug.memmap_on_memory"
> 
> I have to confess that I was not aware of this trick, but looks cleaner
> overall.

Me neither before I spotted the page_alloc page_shuffling usage :)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-02 10:05   ` David Hildenbrand
@ 2020-12-09 10:32     ` Oscar Salvador
  2020-12-09 11:51       ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09 10:32 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote:
> If you take a look at generic_online_page() there are some things that
> won't be done for our vmemmap pages
> 
> 1. kernel_map_pages(page, 1 << order, 1);
> 
> We're accessing these pages already when initializing the memmap. We
> might have to explicitly map these vmemmap pages at some point. Might
> require some thought. Did you test with debug pagealloc?

I always try to run with all debug stuff enabled, but I definitely
did not enable debug_pagealloc.
I will have a look at it.

> 2. totalram_pages_add(1UL << order);
> 
> We should add/remove the vmemmap pages manually from totalram I guess.

Yes, we should. That was a clear oversight.

> Also, I wonder if we should add these pages to the zone managed page
> count. Not quite sure yet what the benefit would be and if we care at
> all. (it's a change in behavior when hotplugging memory, though, that's
> why I am poining it out)

I would rather not since they are not "manageable" per se.
The same we do not account the pages used for bootmem stuff into zone->managed.

> > Hot-remove:
> > 
> >  If the range was using memmap on memory (aka vmemmap pages),
> >  we construct an altmap structure so free_hugepage_table does
> >  the right thing and calls vmem_altmap_free instead of
> >  free_pagetable.
> > 
> 
> You might want to update the comment in include/linux/mmzone.h for
> ZONE_MOVABLE, adding this special case when onlining such memory to the
> MOVABLE zone. The pages are movable for offlining only, not for allocations.

Sure, let us document that.

> > -int create_memory_block_devices(unsigned long start, unsigned long size)
> > +int create_memory_block_devices(unsigned long start, unsigned long size,
> > +				unsigned long vmemmap_pages)
> >  {
> >  	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
> >  	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
> > @@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
> >  		return -EINVAL;
> >  
> >  	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
> > -		ret = init_memory_block(block_id, MEM_OFFLINE);
> > +		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
> >  		if (ret)
> >  			break;
> > +		vmemmap_pages = 0;
> 
> We should even bail out if someone would have vmemmap_pages set when
> creating more than one memory block device.

That cannot really happen as we would have bailed out earlier prior to
call create_memory_block_devices, by the checks in mhp_supports_memmap_on_memory.

> > -	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> > -		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> > +	for (pfn = buddy_start_pfn; pfn < end_pfn; pfn += (1 << order)) {
> > +		order = MAX_ORDER - 1;
> > +		while (pfn & ((1 << order) - 1))
> > +			order--;
> > +		(*online_page_callback)(pfn_to_page(pfn), order);
> > +	}
> 
> I think it would be nicer to handle the unaligned part in a separate loop.

No strong opinion here, I can certainly do it.

> >  	/* mark all involved sections as online */
> >  	online_mem_sections(start_pfn, end_pfn);
> > @@ -678,6 +684,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
> >  	pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
> >  
> >  }
> > +
> 
> Unrelated change.

Ups ^^

> > +static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> > +{
> > +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> > +	int ret = !mem->nr_vmemmap_pages;
> > +
> > +	if (!ret)
> > +		*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> > +	return ret;
> 
> When removing in the same granularity than adding, all we care about it
> the first memory block.
> 
> 
> unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> 
> *nr_vmemmap_pages = mem->nr_vmemmap_pages;
> /* We only care about the first memory block. */
> return 1;

Yeah, cleaner, thanks

> > +	/*
> > +	 * Prepare a vmem_altmap struct if we used it at hot-add, so
> > +	 * remove_pmd_table->free_hugepage_table does the right thing.
> > +	 */
> > +	(void)walk_memory_blocks(start, size, &nr_vmemmap_pages,
> > +	     get_memblock_vmemmap_pages_cb);
> > +	if (nr_vmemmap_pages) {
> > +		mhp_altmap.alloc = nr_vmemmap_pages;
> > +		altmap = &mhp_altmap;
> > +	}
> > +
> 
> If someone would remove_memory() in a different granularity than
> add_memory(), this would no longer work. How can we catch that
> efficiently? Or at least document that this is not supported with
> memmap_on_memory).

Well, we can check whether the size spans more than a single memory block.
And if it does, we can print a warning with pr_warn and refuse to remove
the memory.

We could document that "memory_hotplug.memmap_on_memory" is meant to operate
with ranges that span a single memory block.
And I guess that we could place that documentation under [1].

[1] Documentation/admin-guide/kernel-parameters.txt

Thanks for the review David

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range
  2020-12-09 10:32     ` Oscar Salvador
@ 2020-12-09 11:51       ` Oscar Salvador
  0 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2020-12-09 11:51 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: mhocko, linux-kernel, linux-mm, vbabka, pasha.tatashin

On Wed, Dec 09, 2020 at 11:32:26AM +0100, Oscar Salvador wrote:
> On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote:
> > If you take a look at generic_online_page() there are some things that
> > won't be done for our vmemmap pages
> > 
> > 1. kernel_map_pages(page, 1 << order, 1);
> > 
> > We're accessing these pages already when initializing the memmap. We
> > might have to explicitly map these vmemmap pages at some point. Might
> > require some thought. Did you test with debug pagealloc?
> 
> I always try to run with all debug stuff enabled, but I definitely
> did not enable debug_pagealloc.
> I will have a look at it.
> 
> > 2. totalram_pages_add(1UL << order);
> > 
> > We should add/remove the vmemmap pages manually from totalram I guess.
> 
> Yes, we should. That was a clear oversight.

Looking closer, I do not think we have to account those into totalram.
I might be mistaken but looking at memblock_free_all, it seems we only
account to totalram_pages those ranges laying in memblock.memory filtering
out memblock.reserved.

And it seems that the pages we use for pglist_data structs (the ones we handle
in register_page_bootmem_info_node) fall in memblock.reserved.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-12-09 11:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-01 11:51 [RFC PATCH v3 0/4] Allocate memmap from hotadded memory (per device) Oscar Salvador
2020-12-01 11:51 ` [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory Oscar Salvador
2020-12-02  9:36   ` David Hildenbrand
2020-12-09  9:36     ` Oscar Salvador
2020-12-09  9:40       ` David Hildenbrand
2020-12-09  9:43         ` Oscar Salvador
2020-12-01 11:51 ` [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2020-12-02 10:05   ` David Hildenbrand
2020-12-09 10:32     ` Oscar Salvador
2020-12-09 11:51       ` Oscar Salvador
2020-12-01 11:51 ` [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
2020-12-02  9:37   ` David Hildenbrand
2020-12-09  9:38     ` Oscar Salvador
2020-12-01 11:51 ` [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option Oscar Salvador
2020-12-02  9:42   ` David Hildenbrand
2020-12-09 10:02     ` Oscar Salvador
2020-12-09 10:04       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).