linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Allocate memmap from hotadded memory (per device)
@ 2021-02-09 13:38 Oscar Salvador
  2021-02-09 13:38 ` [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Hi,

here is v2.

Changes from v1 -> v2
 - Addressed feedback from David
 - Fence off the feature in case struct page size is not
   multiple of PMD size or pageblock alignment cannot be guaranted
 - Tested on x86_64 small and large memory_blocks
 - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
   my VM with 16KB page size).

 Arm64 with 4KB page size behaves like x86_64 after [1], which made section
 size smaller.
 With 64KB, the feature gets fenced off due to pageblock alignment.

Changes from RFCv3 -> v1:
 - Addressed feedback from David
 - Re-order patches

Changes from v2 -> v3 (RFC):
 - Re-order patches (Michal)
 - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
 - Add kernel boot option to enable this feature (Michal)

Changes from v1 -> v2 (RFC):
 - Addressed feedback provided by David
 - Add a arch_support_memmap_on_memory to be called
   from mhp_supports_memmap_on_memory, as atm,
   only ARM, powerpc and x86_64 have altmat support.

[1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraja@codeaurora.org/

Original cover letter:

----

The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
The current way we use to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.
c) due to fragmentation we might end up populating memmap with base
   pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate.
memory_block structure gained a new field called nr_vmemmap_pages.
This plays well for two reasons:

 1) {offline/online}_pages know the difference between start_pfn and
    buddy_start_pfn, which is start_pfn + nr_vmemmap_pages.
    In this way all isolation/migration operations are
    done to within the right range of memory without vmemmap pages.
    This allows us for a much cleaner handling.

 2) In try_remove_memory, we construct a new vmemap_altmap struct with the
    right information based on memory_block->nr_vmemap_pages, so we end up
    calling vmem_altmap_free instead of free_pagetable when removing the memory.

Oscar Salvador (7):
  mm,memory_hotplug: Allocate memmap from the added memory range
  acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory
  x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

 Documentation/admin-guide/kernel-parameters.txt |  14 +++
 arch/arm64/Kconfig                              |   4 +
 arch/x86/Kconfig                                |   4 +
 drivers/acpi/acpi_memhotplug.c                  |   5 +-
 drivers/base/memory.c                           |  20 ++--
 include/linux/memory.h                          |   8 +-
 include/linux/memory_hotplug.h                  |  21 +++-
 include/linux/memremap.h                        |   2 +-
 include/linux/mmzone.h                          |   5 +
 mm/Kconfig                                      |   3 +
 mm/Makefile                                     |   5 +-
 mm/memory_hotplug.c                             | 150 ++++++++++++++++++++----
 mm/page_alloc.c                                 |   4 +-
 13 files changed, 208 insertions(+), 37 deletions(-)

-- 
2.16.3



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:58   ` David Hildenbrand
  2021-02-09 13:38 ` [PATCH v2 2/7] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This comes in handy as in {online,offline}_pages, all the isolation and
migration is being done on (buddy_start_pfn, end_pfn] range,
being buddy_start_pfn = start_pfn + nr_vmemmap_pages.

In this way, we have:

(start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
(buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c          |  20 ++++---
 include/linux/memory.h         |   8 ++-
 include/linux/memory_hotplug.h |  21 +++++--
 include/linux/memremap.h       |   2 +-
 include/linux/mmzone.h         |   5 ++
 mm/Kconfig                     |   3 +
 mm/memory_hotplug.c            | 121 ++++++++++++++++++++++++++++++++++-------
 mm/page_alloc.c                |   4 +-
 8 files changed, 149 insertions(+), 35 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..5ea2b3fbce02 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -175,7 +175,7 @@ int memory_notify(unsigned long val, void *v)
  */
 static int
 memory_block_action(unsigned long start_section_nr, unsigned long action,
-		    int online_type, int nid)
+		    int online_type, int nid, unsigned long nr_vmemmap_pages)
 {
 	unsigned long start_pfn;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
+				   online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages);
+		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
@@ -211,7 +212,7 @@ static int memory_block_change_state(struct memory_block *mem,
 		mem->state = MEM_GOING_OFFLINE;
 
 	ret = memory_block_action(mem->start_section_nr, to_state,
-				  mem->online_type, mem->nid);
+				  mem->online_type, mem->nid, mem->nr_vmemmap_pages);
 
 	mem->state = ret ? from_state_req : to_state;
 
@@ -567,7 +568,8 @@ int register_memory(struct memory_block *memory)
 	return ret;
 }
 
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+			     unsigned long nr_vmemmap_pages)
 {
 	struct memory_block *mem;
 	int ret = 0;
@@ -584,6 +586,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
 	mem->start_section_nr = block_id * sections_per_block;
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
+	mem->nr_vmemmap_pages = nr_vmemmap_pages;
 
 	ret = register_memory(mem);
 
@@ -603,7 +606,7 @@ static int add_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -625,7 +628,8 @@ static void unregister_memory(struct memory_block *memory)
  *
  * Called under device_hotplug_lock.
  */
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -638,7 +642,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
 		if (ret)
 			break;
 	}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 4da95e684e20..97e92e8b556a 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -29,6 +29,11 @@ struct memory_block {
 	int online_type;		/* for passing data to online routine */
 	int nid;			/* NID for this memory block */
 	struct device dev;
+	/*
+	 * Number of vmemmap pages. These pages
+	 * lay at the beginning of the memory block.
+	 */
+	unsigned long nr_vmemmap_pages;
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@ static inline int memory_notify(unsigned long val, void *v)
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7288aa5ef73b..a85d4b7d15c2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -55,6 +55,14 @@ typedef int __bitwise mhp_t;
  */
 #define MHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -101,11 +109,13 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			int online_type, int nid);
+			unsigned long nr_vmemmap_pages, int online_type,
+			int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn);
+				     unsigned long end_pfn,
+				     unsigned long buddy_start_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
@@ -307,7 +317,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 unsigned long nr_vmemmap_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
@@ -315,7 +326,8 @@ extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 #else
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				unsigned long nr_vmemmap_pages)
 {
 	return -EINVAL;
 }
@@ -359,6 +371,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 84a9c2fe9c90..b46f63dcaed3 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,7 +17,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5e0f79a4092b..6e968923f320 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -436,6 +436,11 @@ enum zone_type {
 	 *    situations where ZERO_PAGE(0) which is allocated differently
 	 *    on different platforms may end up in a movable zone. ZERO_PAGE(0)
 	 *    cannot be migrated.
+	 * 7. Memory-hotplug: when using memmap_on_memory and onlining the memory
+	 *    to the MOVABLE zone, the vmemmap pages are also placed in such
+	 *    zone. Such pages cannot be really moved around as they are
+	 *    self-stored in the range, but they are treated as movable when
+	 *    the range they describe is about to be offlined.
 	 *
 	 * In general, no unmovable allocations that degrade memory offlining
 	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/mm/Kconfig b/mm/Kconfig
index ec35bf406439..761dbaecfc79 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -183,6 +183,9 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	bool
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ba51a8bdaeb..9bcc460e8bfe 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
 #include "internal.h"
 #include "shuffle.h"
 
+static bool memmap_on_memory_enabled;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+			       unsigned long buddy_start_pfn)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn;
+	unsigned long pfn = buddy_start_pfn;
+
+	/*
+	 * When using memmap_on_memory, the range might be unaligned as the
+	 * first pfns are used for vmemmap pages. Align it in case we need to.
+	 */
+	if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {
+		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
+		pfn += 1 << pageblock_order;
+	}
 
 	/*
 	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -649,7 +661,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
 	 */
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+	for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
 		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
 
 	/* mark all involved sections as online */
@@ -830,9 +842,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+		       unsigned long nr_vmemmap_pages, int online_type, int nid)
 {
-	unsigned long flags;
+	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
@@ -843,11 +855,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
+	if (nr_vmemmap_pages)
+		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
+				       MIGRATE_UNMOVABLE);
+	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+			       MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
@@ -863,7 +882,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
@@ -876,7 +895,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}
 
-	online_pages_range(pfn, nr_pages);
+	online_pages_range(pfn, nr_pages, buddy_start_pfn);
 	zone->present_pages += nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -889,7 +908,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+	undo_isolate_page_range(buddy_start_pfn,
+				buddy_start_pfn + buddy_nr_pages,
+				MIGRATE_MOVABLE);
 
 	/*
 	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	return memmap_on_memory_enabled &&
+	       size == memory_block_size_bytes();
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1073,6 +1100,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+	struct vmem_altmap mhp_altmap = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1099,13 +1127,26 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 	new_node = ret;
 
+	/*
+	 * Self hosted memmap array
+	 */
+	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+		if (!mhp_supports_memmap_on_memory(size)) {
+			ret = -EINVAL;
+			goto error;
+		}
+		mhp_altmap.free = PHYS_PFN(size);
+		mhp_altmap.base_pfn = PHYS_PFN(start);
+		params.altmap = &mhp_altmap;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
 		arch_remove_memory(nid, start, size, NULL);
 		goto error;
@@ -1563,10 +1604,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			unsigned long nr_vmemmap_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, system_ram_pages = 0;
+	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/*
@@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone_pcp_disable(zone);
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
+	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 
 	do {
-		pfn = start_pfn;
+		pfn = buddy_start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}
 
-		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
 
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
 	pr_debug("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
 	zone->present_pages -= nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
+static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
+
+	*nr_vmemmap_pages += mem->nr_vmemmap_pages;
+	return mem->nr_vmemmap_pages;
+}
+
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1824,6 +1877,9 @@ EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	unsigned long nr_vmemmap_pages = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1836,6 +1892,31 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 
+	/*
+	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
+	 * the same granularity it was added - a single memory block.
+	 */
+	if (memmap_on_memory_enabled) {
+		walk_memory_blocks(start, size, &nr_vmemmap_pages,
+					 get_nr_vmemmap_pages_cb);
+		if (nr_vmemmap_pages) {
+			if (size != memory_block_size_bytes()) {
+				pr_warn("Refuse to remove %#llx - %#llx,"
+					"wrong granularity\n",
+					 start, start + size);
+				return -EINVAL;
+			}
+
+			/*
+			 * Let remove_pmd_table->free_hugepage_table
+			 * do the right thing if we used vmem_altmap
+			 * when hot-adding the range.
+			 */
+			mhp_altmap.alloc = nr_vmemmap_pages;
+			altmap = &mhp_altmap;
+		}
+	}
+
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1847,7 +1928,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, NULL);
+	arch_remove_memory(nid, start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ad3ed3ec4dd5..0d763b10b99b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8827,7 +8827,8 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
+			      unsigned long buddy_start_pfn)
 {
 	unsigned long pfn = start_pfn;
 	struct page *page;
@@ -8838,6 +8839,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
+	pfn = buddy_start_pfn;
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
 		/*
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/7] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2021-02-09 13:38 ` [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-09 13:38 ` [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 drivers/acpi/acpi_memhotplug.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 	acpi_handle handle = mem_device->device->handle;
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
+	mhp_t mhp_flags = MHP_NONE;
 	int node;
 
 	node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
+		if (mhp_supports_memmap_on_memory(info->length))
+			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(node, info->start_addr, info->length,
-				      MHP_NONE);
+				      mhp_flags);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
  2021-02-09 13:38 ` [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
  2021-02-09 13:38 ` [PATCH v2 2/7] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:25   ` David Hildenbrand
  2021-02-09 13:38 ` [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD Oscar Salvador
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 Documentation/admin-guide/kernel-parameters.txt | 14 +++++++++++++
 mm/Makefile                                     |  5 ++++-
 mm/memory_hotplug.c                             | 26 ++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5adf1e57e932..696beb5f6c92 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2811,6 +2811,20 @@
 			seconds.  Use this parameter to check at some
 			other rate.  0 disables periodic checking.
 
+	memory_hotplug.memmap_on_memory
+			[KNL,X86,ARM] Boolean flag to enable this feature.
+			Format: {on | off (default)}
+			When enabled, memory to build the pages tables for the
+			memmap array describing the hot-added range will be taken
+			from the range itself, so the memmap page tables will be
+			self-hosted.
+			Since only single memory device ranges are supported at
+			the moment, this option is disabled by default because
+			it might have an impact on workloads that needs large
+			contiguous memory chunks.
+			The state of the flag can be read in
+			/sys/module/memory_hotplug/parameters/memmap_on_memory.
+
 	memtest=	[KNL,X86,ARM,PPC] Enable memtest
 			Format: <integer>
 			default : 0 <disable>
diff --git a/mm/Makefile b/mm/Makefile
index b2a564eec27f..9e284dba50ef 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)	+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9bcc460e8bfe..95695483a622 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,31 @@
 #include "internal.h"
 #include "shuffle.h"
 
-static bool memmap_on_memory_enabled;
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory_enabled __ro_after_init;
+
+static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%s\n",
+		       memmap_on_memory_enabled ? "on" : "off");
+}
+
+static __meminit int memmap_on_memory_store(const char *val,
+					    const struct kernel_param *kp)
+{
+	/*
+	 * Fail silently in case we cannot enable it due to platform constraints.
+	 * User can always check whether it is enabled or not via /sys/module.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
+		return 0;
+
+	return param_set_bool(val, kp);
+}
+module_param_call(memmap_on_memory, memmap_on_memory_store,
+		  memmap_on_memory_show, &memmap_on_memory_enabled, 0400);
 
 /*
  * online_page_callback contains pointer to current page onlining function.
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (2 preceding siblings ...)
  2021-02-09 13:38 ` [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:26   ` David Hildenbrand
  2021-02-09 13:38 ` [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory Oscar Salvador
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

When struct page's size is not multiple of PMD, these do not get
fully populated when adding sections, hence two sections will
intersect the same the PMD. This goes against the vmemmap-per-device
premise, so reject it if that is the case.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory_hotplug.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 95695483a622..d3fb036d33fd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -57,10 +57,11 @@ static __meminit int memmap_on_memory_store(const char *val,
 					    const struct kernel_param *kp)
 {
 	/*
-	 * Fail silently in case we cannot enable it due to platform constraints.
+	 * Fail silently in case we cannot enable it due to system constraints.
 	 * User can always check whether it is enabled or not via /sys/module.
 	 */
-	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
+	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
+	    (PMD_SIZE % sizeof(struct page)))
 		return 0;
 
 	return param_set_bool(val, kp);
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (3 preceding siblings ...)
  2021-02-09 13:38 ` [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:27   ` David Hildenbrand
  2021-02-09 13:38 ` [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Many places expects us to pass a pageblock aligned range.
E.g: memmap_init_zone() needs a pageblock aligned range in order
to set the proper migrate type for it.
online_pages() needs to operate on a pageblock aligned range for
isolation purposes.

Make sure we disable the feature in case we cannot guarantee the
right alignment.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory_hotplug.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d3fb036d33fd..1a4d5dd1a2c8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -56,12 +56,16 @@ static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
 static __meminit int memmap_on_memory_store(const char *val,
 					    const struct kernel_param *kp)
 {
+	unsigned long pageblock_size = PFN_PHYS(pageblock_nr_pages);
+
 	/*
 	 * Fail silently in case we cannot enable it due to system constraints.
 	 * User can always check whether it is enabled or not via /sys/module.
 	 */
 	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
-	    (PMD_SIZE % sizeof(struct page)))
+	    (PMD_SIZE % sizeof(struct page)) ||
+	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) ||
+	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) % pageblock_size)
 		return 0;
 
 	return param_set_bool(val, kp);
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (4 preceding siblings ...)
  2021-02-09 13:38 ` [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:29   ` David Hildenbrand
  2021-02-09 13:38 ` [PATCH v2 7/7] arm64/Kconfig: " Oscar Salvador
  2021-02-17 10:18 ` [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 72663de8b04c..81046b7adb10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2440,6 +2440,10 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 	depends on MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	def_bool y
+	depends on X86_64 && MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
+
 config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/7] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (5 preceding siblings ...)
  2021-02-09 13:38 ` [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
@ 2021-02-09 13:38 ` Oscar Salvador
  2021-02-25 18:29   ` David Hildenbrand
  2021-02-17 10:18 ` [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-09 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual,
	Oscar Salvador

Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 87fd02a7a62f..d4fb29779cd4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -309,6 +309,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+	def_bool y
+	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
+
 config SMP
 	def_bool y
 
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/7] Allocate memmap from hotadded memory (per device)
  2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
                   ` (6 preceding siblings ...)
  2021-02-09 13:38 ` [PATCH v2 7/7] arm64/Kconfig: " Oscar Salvador
@ 2021-02-17 10:18 ` Oscar Salvador
  2021-02-22 11:15   ` Oscar Salvador
  7 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-17 10:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Tue, Feb 09, 2021 at 02:38:47PM +0100, Oscar Salvador wrote:
> Hi,
> 
> here is v2.
> 
> Changes from v1 -> v2
>  - Addressed feedback from David
>  - Fence off the feature in case struct page size is not
>    multiple of PMD size or pageblock alignment cannot be guaranted
>  - Tested on x86_64 small and large memory_blocks
>  - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
>    my VM with 16KB page size).
> 
>  Arm64 with 4KB page size behaves like x86_64 after [1], which made section
>  size smaller.
>  With 64KB, the feature gets fenced off due to pageblock alignment.
> 
> Changes from RFCv3 -> v1:
>  - Addressed feedback from David
>  - Re-order patches
> 
> Changes from v2 -> v3 (RFC):
>  - Re-order patches (Michal)
>  - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
>  - Add kernel boot option to enable this feature (Michal)
> 
> Changes from v1 -> v2 (RFC):
>  - Addressed feedback provided by David
>  - Add a arch_support_memmap_on_memory to be called
>    from mhp_supports_memmap_on_memory, as atm,
>    only ARM, powerpc and x86_64 have altmat support.
> 
> [1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraja@codeaurora.org/

Let me refloat this one :-)

> 
> Original cover letter:
> 
> ----
> 
> The primary goal of this patchset is to reduce memory overhead of the
> hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
> The current way we use to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> c) due to fragmentation we might end up populating memmap with base
>    pages
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
> us to map any pfn range so the memory doesn't need to be online to be
> usable for the array. See patch 3 for more details.
> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
> 
> [Overall design]:
> 
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by vmemap_populate.
> memory_block structure gained a new field called nr_vmemmap_pages.
> This plays well for two reasons:
> 
>  1) {offline/online}_pages know the difference between start_pfn and
>     buddy_start_pfn, which is start_pfn + nr_vmemmap_pages.
>     In this way all isolation/migration operations are
>     done to within the right range of memory without vmemmap pages.
>     This allows us for a much cleaner handling.
> 
>  2) In try_remove_memory, we construct a new vmemap_altmap struct with the
>     right information based on memory_block->nr_vmemap_pages, so we end up
>     calling vmem_altmap_free instead of free_pagetable when removing the memory.
> 
> Oscar Salvador (7):
>   mm,memory_hotplug: Allocate memmap from the added memory range
>   acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
>   mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
>   mm,memory_hotplug: Enforce struct page size to be multiple of PMD
>   mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory
>   x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>   arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> 
>  Documentation/admin-guide/kernel-parameters.txt |  14 +++
>  arch/arm64/Kconfig                              |   4 +
>  arch/x86/Kconfig                                |   4 +
>  drivers/acpi/acpi_memhotplug.c                  |   5 +-
>  drivers/base/memory.c                           |  20 ++--
>  include/linux/memory.h                          |   8 +-
>  include/linux/memory_hotplug.h                  |  21 +++-
>  include/linux/memremap.h                        |   2 +-
>  include/linux/mmzone.h                          |   5 +
>  mm/Kconfig                                      |   3 +
>  mm/Makefile                                     |   5 +-
>  mm/memory_hotplug.c                             | 150 ++++++++++++++++++++----
>  mm/page_alloc.c                                 |   4 +-
>  13 files changed, 208 insertions(+), 37 deletions(-)
> 
> -- 
> 2.16.3
> 
> 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/7] Allocate memmap from hotadded memory (per device)
  2021-02-17 10:18 ` [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
@ 2021-02-22 11:15   ` Oscar Salvador
  2021-02-22 11:28     ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-22 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Wed, Feb 17, 2021 at 11:18:59AM +0100, Oscar Salvador wrote:
> On Tue, Feb 09, 2021 at 02:38:47PM +0100, Oscar Salvador wrote:
> > Hi,
> > 
> > here is v2.
> > 
> > Changes from v1 -> v2
> >  - Addressed feedback from David
> >  - Fence off the feature in case struct page size is not
> >    multiple of PMD size or pageblock alignment cannot be guaranted
> >  - Tested on x86_64 small and large memory_blocks
> >  - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
> >    my VM with 16KB page size).
> > 
> >  Arm64 with 4KB page size behaves like x86_64 after [1], which made section
> >  size smaller.
> >  With 64KB, the feature gets fenced off due to pageblock alignment.
> > 
> > Changes from RFCv3 -> v1:
> >  - Addressed feedback from David
> >  - Re-order patches
> > 
> > Changes from v2 -> v3 (RFC):
> >  - Re-order patches (Michal)
> >  - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
> >  - Add kernel boot option to enable this feature (Michal)
> > 
> > Changes from v1 -> v2 (RFC):
> >  - Addressed feedback provided by David
> >  - Add a arch_support_memmap_on_memory to be called
> >    from mhp_supports_memmap_on_memory, as atm,
> >    only ARM, powerpc and x86_64 have altmat support.
> > 
> > [1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraja@codeaurora.org/
> 
> Let me refloat this one :-)

Kindly ping :-)

 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/7] Allocate memmap from hotadded memory (per device)
  2021-02-22 11:15   ` Oscar Salvador
@ 2021-02-22 11:28     ` David Hildenbrand
  2021-02-23  7:48       ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-22 11:28 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 22.02.21 12:15, Oscar Salvador wrote:
> On Wed, Feb 17, 2021 at 11:18:59AM +0100, Oscar Salvador wrote:
>> On Tue, Feb 09, 2021 at 02:38:47PM +0100, Oscar Salvador wrote:
>>> Hi,
>>>
>>> here is v2.
>>>
>>> Changes from v1 -> v2
>>>   - Addressed feedback from David
>>>   - Fence off the feature in case struct page size is not
>>>     multiple of PMD size or pageblock alignment cannot be guaranted
>>>   - Tested on x86_64 small and large memory_blocks
>>>   - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
>>>     my VM with 16KB page size).
>>>
>>>   Arm64 with 4KB page size behaves like x86_64 after [1], which made section
>>>   size smaller.
>>>   With 64KB, the feature gets fenced off due to pageblock alignment.
>>>
>>> Changes from RFCv3 -> v1:
>>>   - Addressed feedback from David
>>>   - Re-order patches
>>>
>>> Changes from v2 -> v3 (RFC):
>>>   - Re-order patches (Michal)
>>>   - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
>>>   - Add kernel boot option to enable this feature (Michal)
>>>
>>> Changes from v1 -> v2 (RFC):
>>>   - Addressed feedback provided by David
>>>   - Add a arch_support_memmap_on_memory to be called
>>>     from mhp_supports_memmap_on_memory, as atm,
>>>     only ARM, powerpc and x86_64 have altmat support.
>>>
>>> [1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraja@codeaurora.org/
>>
>> Let me refloat this one :-)
> 
> Kindly ping :-)

-EBUSY, will try having a look this week!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/7] Allocate memmap from hotadded memory (per device)
  2021-02-22 11:28     ` David Hildenbrand
@ 2021-02-23  7:48       ` Oscar Salvador
  0 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-23  7:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Mon, Feb 22, 2021 at 12:28:22PM +0100, David Hildenbrand wrote:
> -EBUSY, will try having a look this week!

sure, thanks for the effort David ;-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  2021-02-09 13:38 ` [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
@ 2021-02-25 18:25   ` David Hildenbrand
  2021-02-26 12:14     ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:25 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9bcc460e8bfe..95695483a622 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,7 +42,31 @@
>   #include "internal.h"
>   #include "shuffle.h"
>   
> -static bool memmap_on_memory_enabled;
> +/*
> + * memory_hotplug.memmap_on_memory parameter
> + */
> +static bool memmap_on_memory_enabled __ro_after_init;
> +
> +static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
> +{
> +	return sprintf(buffer, "%s\n",
> +		       memmap_on_memory_enabled ? "on" : "off");
> +}
> +
> +static __meminit int memmap_on_memory_store(const char *val,
> +					    const struct kernel_param *kp)
> +{
> +	/*
> +	 * Fail silently in case we cannot enable it due to platform constraints.
> +	 * User can always check whether it is enabled or not via /sys/module.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
> +		return 0;
> +
> +	return param_set_bool(val, kp);
> +}
> +module_param_call(memmap_on_memory, memmap_on_memory_store,
> +		  memmap_on_memory_show, &memmap_on_memory_enabled, 0400);

Why are other users not allowed to read?

>   

Might want to add a MODULE_PARM_DESC().


As we fail silently already, I'd keep it simple and use a

module_param(memmap_on_memory, bool, 0444);

moving the IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) into the 
runtime check.


Apart from that LGTM.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  2021-02-09 13:38 ` [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD Oscar Salvador
@ 2021-02-25 18:26   ` David Hildenbrand
  2021-02-26 12:06     ` Oscar Salvador
  2021-03-01 10:06     ` Oscar Salvador
  0 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:26 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 09.02.21 14:38, Oscar Salvador wrote:
> When struct page's size is not multiple of PMD, these do not get
> fully populated when adding sections, hence two sections will
> intersect the same the PMD. This goes against the vmemmap-per-device
> premise, so reject it if that is the case.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/memory_hotplug.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 95695483a622..d3fb036d33fd 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -57,10 +57,11 @@ static __meminit int memmap_on_memory_store(const char *val,
>   					    const struct kernel_param *kp)
>   {
>   	/*
> -	 * Fail silently in case we cannot enable it due to platform constraints.
> +	 * Fail silently in case we cannot enable it due to system constraints.
>   	 * User can always check whether it is enabled or not via /sys/module.
>   	 */
> -	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
> +	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
> +	    (PMD_SIZE % sizeof(struct page)))
>   		return 0;
>   
>   	return param_set_bool(val, kp);
> 

Squash that into #1 - it's part of basic operation.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory
  2021-02-09 13:38 ` [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory Oscar Salvador
@ 2021-02-25 18:27   ` David Hildenbrand
  2021-02-26 12:06     ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:27 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 09.02.21 14:38, Oscar Salvador wrote:
> Many places expects us to pass a pageblock aligned range.
> E.g: memmap_init_zone() needs a pageblock aligned range in order
> to set the proper migrate type for it.
> online_pages() needs to operate on a pageblock aligned range for
> isolation purposes.
> 
> Make sure we disable the feature in case we cannot guarantee the
> right alignment.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/memory_hotplug.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d3fb036d33fd..1a4d5dd1a2c8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -56,12 +56,16 @@ static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
>   static __meminit int memmap_on_memory_store(const char *val,
>   					    const struct kernel_param *kp)
>   {
> +	unsigned long pageblock_size = PFN_PHYS(pageblock_nr_pages);
> +
>   	/*
>   	 * Fail silently in case we cannot enable it due to system constraints.
>   	 * User can always check whether it is enabled or not via /sys/module.
>   	 */
>   	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
> -	    (PMD_SIZE % sizeof(struct page)))
> +	    (PMD_SIZE % sizeof(struct page)) ||
> +	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) ||
> +	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) % pageblock_size)
>   		return 0;
>   
>   	return param_set_bool(val, kp);
> 

Dito, rather squash in #1 and add a comment explaining what's happening 
there.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-09 13:38 ` [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
@ 2021-02-25 18:29   ` David Hildenbrand
  2021-02-26 12:04     ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:29 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 09.02.21 14:38, Oscar Salvador wrote:
> Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   arch/x86/Kconfig | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 72663de8b04c..81046b7adb10 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2440,6 +2440,10 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
>   	def_bool y
>   	depends on MEMORY_HOTPLUG
>   
> +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> +	def_bool y
> +	depends on X86_64 && MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE

It depends on SPARSEMEM_VMEMMAP, no?

I think we could have

SPARSEMEM_VMEMMAP_ENABLE=y
and
SPARSEMEM_VMEMMAP=n

on manually tuned configs.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-09 13:38 ` [PATCH v2 7/7] arm64/Kconfig: " Oscar Salvador
@ 2021-02-25 18:29   ` David Hildenbrand
  2021-02-26 12:10     ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:29 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 09.02.21 14:38, Oscar Salvador wrote:
> Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   arch/arm64/Kconfig | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 87fd02a7a62f..d4fb29779cd4 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -309,6 +309,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
>   config ARCH_ENABLE_MEMORY_HOTREMOVE
>   	def_bool y
>   
> +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> +	def_bool y
> +	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
> +

Same comment as for x86-64 variant.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-02-09 13:38 ` [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
@ 2021-02-25 18:58   ` David Hildenbrand
  2021-02-28 18:50     ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-02-25 18:58 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Michal Hocko, VlastimilBabkavbabka, pasha.tatashin, linux-kernel,
	linux-mm, Anshuman Khandual

On 09.02.21 14:38, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>   a) an existing memory is consumed for that purpose
>      (eg: ~2MB per 128MB memory section on x86_64)
>   b) if the whole node is movable then we have off-node struct pages
>      which has performance drawbacks.
>   c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>      populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

nit: shouldn't it be

[start_pfn, buddy_start_pfn - 1]
[buddy_start_pfn, end_pfn - 1]

or

[start_pfn, buddy_start_pfn)
[buddy_start_pfn, end_pfn)

(I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) )

I actually prefer the first variant.

> 
> Hot-remove:
> 
>   We need to be careful when removing memory, as adding and
>   removing memory needs to be done with the same granularity.
>   To check that this assumption is not violated, we check the
>   memory range we want to remove and if a) any memory block has
>   vmemmap pages and b) the range spans more than a single memory
>   block, we scream out loud and refuse to proceed.
> 
>   If all is good and the range was using memmap on memory (aka vmemmap pages),
>   we construct an altmap structure so free_hugepage_table does the right
>   thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

[...]

>   /*
>    * online_page_callback contains pointer to current page onlining function.
>    * Initially it is generic_online_page(). If it is required it could be
> @@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order)
>   }
>   EXPORT_SYMBOL_GPL(generic_online_page);
>   
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long buddy_start_pfn)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn;
> +	unsigned long pfn = buddy_start_pfn;
> +
> +	/*
> +	 * When using memmap_on_memory, the range might be unaligned as the
> +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> +	 */
> +	if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {

if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES))

> +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> +		pfn += 1 << pageblock_order;

pfn += pageblock_nr_pages;

Can you add a comment why we can be sure that we are off by  a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?

Would it make thing simpler to just do a

while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
	(*online_page_callback)(pfn_to_page(pfn), 0);
	pfn++;
}

[...]


>   
>   	/*
>   	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> @@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	return memmap_on_memory_enabled &&
> +	       size == memory_block_size_bytes();

Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here.


>   
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			unsigned long nr_vmemmap_pages)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn, system_ram_pages = 0;
> +	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
>   	unsigned long flags;
>   	struct zone *zone;
>   	struct memory_notify arg;
> @@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
>   		return -EINVAL;
>   
> +	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
> +	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> +
>   	mem_hotplug_begin();
>   
>   	/*
> @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	zone_pcp_disable(zone);
>   
>   	/* set above range as isolated */
> -	ret = start_isolate_page_range(start_pfn, end_pfn,
> +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
>   				       MIGRATE_MOVABLE,
>   				       MEMORY_OFFLINE | REPORT_FAILURE);

Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it.


>   	if (ret) {
> @@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	}
>   
>   	do {
> -		pfn = start_pfn;
> +		pfn = buddy_start_pfn;
>   		do {
>   			if (signal_pending(current)) {
>   				ret = -EINTR;
> @@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   		 * offlining actually in order to make hugetlbfs's object
>   		 * counting consistent.
>   		 */
> -		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
> +		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
>   		if (ret) {
>   			reason = "failure to dissolve huge pages";
>   			goto failed_removal_isolated;
>   		}
>   
> -		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> +		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
>   
>   	} while (ret);
>   
>   	/* Mark all sections offline and remove free pages from the buddy. */
> -	__offline_isolated_pages(start_pfn, end_pfn);
> +	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);>   	pr_debug("Offlined Pages %ld\n", nr_pages);
>   
>   	/*
> @@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	 * of isolated pageblocks, memory onlining will properly revert this.
>   	 */
>   	spin_lock_irqsave(&zone->lock, flags);
> -	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
> +	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   
>   	zone_pcp_enable(zone);
>   
>   	/* removal success */
> -	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> +	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
>   	zone->present_pages -= nr_pages;
>   
>   	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
>   	return 0;
>   }
>   

[...]

> +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> +{
> +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> +
> +	*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> +	return mem->nr_vmemmap_pages;
> +}
> +

I think you can do this easier, all you want to know is if there
is any block that has nr_vmemmap_pages set - and return the value.

static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
{
	/* If not set, continue with the next block. */
	return mem->nr_vmemmap_pages;
}

...
> +		walk_memory_blocks(start, size, &nr_vmemmap_pages,
> +					 get_nr_vmemmap_pages_cb);

...

mem->nr_vmemmap_pages = walk_memory_blocks(start ...)



Looks quite promising, only a couple of things to fine-tune :) Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-25 18:29   ` David Hildenbrand
@ 2021-02-26 12:04     ` Oscar Salvador
  2021-03-01  8:45       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-26 12:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:29:07PM +0100, David Hildenbrand wrote:
> On 09.02.21 14:38, Oscar Salvador wrote:
> > Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > ---
> >   arch/x86/Kconfig | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 72663de8b04c..81046b7adb10 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2440,6 +2440,10 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
> >   	def_bool y
> >   	depends on MEMORY_HOTPLUG
> > +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> > +	def_bool y
> > +	depends on X86_64 && MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
> 
> It depends on SPARSEMEM_VMEMMAP, no?
> 
> I think we could have
> 
> SPARSEMEM_VMEMMAP_ENABLE=y
> and
> SPARSEMEM_VMEMMAP=n
> 
> on manually tuned configs.

I do not think this can happen:

from mm/Kconfig:

config SPARSEMEM_VMEMMAP_ENABLE
        bool

config SPARSEMEM_VMEMMAP
        bool "Sparse Memory virtual memmap"
        depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
        default y
        help
          SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
          pfn_to_page and page_to_pfn operations.  This is the most
          efficient option when sufficient kernel resources are available

and from arch/x86/Kconfig:

config ARCH_SPARSEMEM_ENABLE
        def_bool y
        depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
        select SPARSEMEM_STATIC if X86_32
        select SPARSEMEM_VMEMMAP_ENABLE if X86_64


So, if I read this correctly, for SPARSEMEM_VMEMMAP to be true,
SPARSEMEM_VMEMMAP_ENABLE needs to be true as well.

Am I missing something?

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory
  2021-02-25 18:27   ` David Hildenbrand
@ 2021-02-26 12:06     ` Oscar Salvador
  0 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-26 12:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:27:33PM +0100, David Hildenbrand wrote:
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d3fb036d33fd..1a4d5dd1a2c8 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -56,12 +56,16 @@ static int memmap_on_memory_show(char *buffer, const struct kernel_param *kp)
> >   static __meminit int memmap_on_memory_store(const char *val,
> >   					    const struct kernel_param *kp)
> >   {
> > +	unsigned long pageblock_size = PFN_PHYS(pageblock_nr_pages);
> > +
> >   	/*
> >   	 * Fail silently in case we cannot enable it due to system constraints.
> >   	 * User can always check whether it is enabled or not via /sys/module.
> >   	 */
> >   	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
> > -	    (PMD_SIZE % sizeof(struct page)))
> > +	    (PMD_SIZE % sizeof(struct page)) ||
> > +	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) ||
> > +	    !(MIN_MEMORY_BLOCK_SIZE - PMD_SIZE) % pageblock_size)
> >   		return 0;
> >   	return param_set_bool(val, kp);
> > 
> 
> Dito, rather squash in #1 and add a comment explaining what's happening
> there.

I was not sure about putting this and the PMD aligned patch as a
standalone patch, but I thought it might ease the review.
But I have no problem in placing them in patch#1 and put some more
detail into the changelog.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  2021-02-25 18:26   ` David Hildenbrand
@ 2021-02-26 12:06     ` Oscar Salvador
  2021-03-01 10:06     ` Oscar Salvador
  1 sibling, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-26 12:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:26:50PM +0100, David Hildenbrand wrote:
> Squash that into #1 - it's part of basic operation.

Ok, will do.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-25 18:29   ` David Hildenbrand
@ 2021-02-26 12:10     ` Oscar Salvador
  0 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-26 12:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:29:23PM +0100, David Hildenbrand wrote:
> On 09.02.21 14:38, Oscar Salvador wrote:
> > Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > ---
> >   arch/arm64/Kconfig | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 87fd02a7a62f..d4fb29779cd4 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -309,6 +309,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
> >   config ARCH_ENABLE_MEMORY_HOTREMOVE
> >   	def_bool y
> > +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> > +	def_bool y
> > +	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
> > +
> 
> Same comment as for x86-64 variant.

From arm64/Kconfig:

config ARCH_SPARSEMEM_ENABLE
        def_bool y
        select SPARSEMEM_VMEMMAP_ENABLE

config ARCH_SPARSEMEM_DEFAULT
        def_bool ARCH_SPARSEMEM_ENABLE

config ARCH_SELECT_MEMORY_MODEL
        def_bool ARCH_SPARSEMEM_ENABLE

config ARCH_FLATMEM_ENABLE
        def_bool !NUMA

It seems SPARSEMEM_VMEMMAP_ENABLE is ticked by default unless we are on
a !NUMA system. So make it dependent on SPARSEMEM_VMEMMAP_ENABLE seems
the right thing to do?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  2021-02-25 18:25   ` David Hildenbrand
@ 2021-02-26 12:14     ` Oscar Salvador
  2021-03-01  8:27       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-02-26 12:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:25:58PM +0100, David Hildenbrand wrote:
> > +static __meminit int memmap_on_memory_store(const char *val,
> > +					    const struct kernel_param *kp)
> > +{
> > +	/*
> > +	 * Fail silently in case we cannot enable it due to platform constraints.
> > +	 * User can always check whether it is enabled or not via /sys/module.
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
> > +		return 0;
> > +
> > +	return param_set_bool(val, kp);
> > +}
> > +module_param_call(memmap_on_memory, memmap_on_memory_store,
> > +		  memmap_on_memory_show, &memmap_on_memory_enabled, 0400);
> 
> Why are other users not allowed to read?

I copied it from the shuffle one. I guess shuffle code made it 0400 for
security reasons.
Since we do not need that here, I will switch it to 0444.

> Might want to add a MODULE_PARM_DESC().

Sure

> As we fail silently already, I'd keep it simple and use a
> 
> module_param(memmap_on_memory, bool, 0444);
> 
> moving the IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) into the
> runtime check.

You mean moving this check into mhp_supports_memmap_on_memory() check
from patch#1 right?

Makes sense.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-02-25 18:58   ` David Hildenbrand
@ 2021-02-28 18:50     ` Oscar Salvador
  2021-03-01  8:30       ` David Hildenbrand
  2021-03-01  8:32       ` David Hildenbrand
  0 siblings, 2 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-02-28 18:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:58:01PM +0100, David Hildenbrand wrote:
> > In this way, we have:
> > 
> > (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> > (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy
> 
> nit: shouldn't it be
> 
> [start_pfn, buddy_start_pfn - 1]
> [buddy_start_pfn, end_pfn - 1]
> 
> or
> 
> [start_pfn, buddy_start_pfn)
> [buddy_start_pfn, end_pfn)
> 
> (I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) )
> 
> I actually prefer the first variant.

Let us go witht the first variant, I guess it is more clear.

> > -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> > +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> > +			       unsigned long buddy_start_pfn)
> >   {
> >   	const unsigned long end_pfn = start_pfn + nr_pages;
> > -	unsigned long pfn;
> > +	unsigned long pfn = buddy_start_pfn;
> > +
> > +	/*
> > +	 * When using memmap_on_memory, the range might be unaligned as the
> > +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> > +	 */
> > +	if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {
> 
> if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES))

Will change

> 
> > +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > +		pfn += 1 << pageblock_order;
> 
> pfn += pageblock_nr_pages;
> 
> Can you add a comment why we can be sure that we are off by  a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?
> 
> Would it make thing simpler to just do a
> 
> while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> 	(*online_page_callback)(pfn_to_page(pfn), 0);
> 	pfn++;
> }

Honestly, I did not spend much time thinking on other platforms other
than arm64/x86_64.
But I think that that would be the universal solution as we do not make
any assumptions.

I will replace it.

> > +bool mhp_supports_memmap_on_memory(unsigned long size)
> > +{
> > +	return memmap_on_memory_enabled &&
> > +	       size == memory_block_size_bytes();
> 
> Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here.

Ok, will do.

> > @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> >   	zone_pcp_disable(zone);
> >   	/* set above range as isolated */
> > -	ret = start_isolate_page_range(start_pfn, end_pfn,
> > +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
> >   				       MIGRATE_MOVABLE,
> >   				       MEMORY_OFFLINE | REPORT_FAILURE);
> 
> Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it.

No, I did not. Good that you noticed :-)

Will fix it up in the next version.

> > +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> > +{
> > +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> > +
> > +	*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> > +	return mem->nr_vmemmap_pages;
> > +}
> > +
> 
> I think you can do this easier, all you want to know is if there
> is any block that has nr_vmemmap_pages set - and return the value.
> 
> static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> {
> 	/* If not set, continue with the next block. */
> 	return mem->nr_vmemmap_pages;
> }

Yeah, less code.
Will fix it.

> 
> ...
> > +		walk_memory_blocks(start, size, &nr_vmemmap_pages,
> > +					 get_nr_vmemmap_pages_cb);
> 
> ...
> 
> mem->nr_vmemmap_pages = walk_memory_blocks(start ...)
> 
> 
> 
> Looks quite promising, only a couple of things to fine-tune :) Thanks!

Thanks for having a look, that is highly appreciated!

Let us see if we can polish the minor things that are missing and target
this for the next release.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  2021-02-26 12:14     ` Oscar Salvador
@ 2021-03-01  8:27       ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-03-01  8:27 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

>> module_param(memmap_on_memory, bool, 0444);
>>
>> moving the IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) into the
>> runtime check.
> 
> You mean moving this check into mhp_supports_memmap_on_memory() check
> from patch#1 right?
> 
> Makes sense.

Yep.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-02-28 18:50     ` Oscar Salvador
@ 2021-03-01  8:30       ` David Hildenbrand
  2021-03-01  8:32       ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-03-01  8:30 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

>>> +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
>>> +		pfn += 1 << pageblock_order;
>>
>> pfn += pageblock_nr_pages;
>>
>> Can you add a comment why we can be sure that we are off by  a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?
>>
>> Would it make thing simpler to just do a
>>
>> while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
>> 	(*online_page_callback)(pfn_to_page(pfn), 0);
>> 	pfn++;
>> }
> 
> Honestly, I did not spend much time thinking on other platforms other
> than arm64/x86_64.
> But I think that that would be the universal solution as we do not make
> any assumptions.
> 
> I will replace it.

I think you could also go with

while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
     (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
  	pageblock_nr_pages;
}

And maybe add a comment why we can be sure it has to be aligned to 
pageblocks. Or maybe rather add

VM_BUG_ON(IS_ALIGNED(pfn, pageblock_nr_pages));

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
  2021-02-28 18:50     ` Oscar Salvador
  2021-03-01  8:30       ` David Hildenbrand
@ 2021-03-01  8:32       ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual


>>> +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
>>> +		pfn += 1 << pageblock_order;
>>
>> pfn += pageblock_nr_pages;
>>
>> Can you add a comment why we can be sure that we are off by  a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?
>>
>> Would it make thing simpler to just do a
>>
>> while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
>> 	(*online_page_callback)(pfn_to_page(pfn), 0);
>> 	pfn++;
>> }
> 
> Honestly, I did not spend much time thinking on other platforms other
> than arm64/x86_64.
> But I think that that would be the universal solution as we do not make
> any assumptions.
> 
> I will replace it.


I think you can safely go with

while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
	(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
	pfn += pageblock_nr_pages;
}

and maybe add before the loop

VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));

as help for the reader that this always holds.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-02-26 12:04     ` Oscar Salvador
@ 2021-03-01  8:45       ` David Hildenbrand
  2021-03-01 10:07         ` Oscar Salvador
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-03-01  8:45 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On 26.02.21 13:04, Oscar Salvador wrote:
> On Thu, Feb 25, 2021 at 07:29:07PM +0100, David Hildenbrand wrote:
>> On 09.02.21 14:38, Oscar Salvador wrote:
>>> Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
>>>
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>>> ---
>>>    arch/x86/Kconfig | 4 ++++
>>>    1 file changed, 4 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 72663de8b04c..81046b7adb10 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -2440,6 +2440,10 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
>>>    	def_bool y
>>>    	depends on MEMORY_HOTPLUG
>>> +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>>> +	def_bool y
>>> +	depends on X86_64 && MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP_ENABLE
>>
>> It depends on SPARSEMEM_VMEMMAP, no?
>>
>> I think we could have
>>
>> SPARSEMEM_VMEMMAP_ENABLE=y
>> and
>> SPARSEMEM_VMEMMAP=n
>>
>> on manually tuned configs.
> 
> I do not think this can happen:
> 
> from mm/Kconfig:
> 
> config SPARSEMEM_VMEMMAP_ENABLE
>          bool
> 
> config SPARSEMEM_VMEMMAP
>          bool "Sparse Memory virtual memmap"
>          depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
>          default y
>          help
>            SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
>            pfn_to_page and page_to_pfn operations.  This is the most
>            efficient option when sufficient kernel resources are available
> 
> and from arch/x86/Kconfig:
> 
> config ARCH_SPARSEMEM_ENABLE
>          def_bool y
>          depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>          select SPARSEMEM_STATIC if X86_32
>          select SPARSEMEM_VMEMMAP_ENABLE if X86_64
> 
> 
> So, if I read this correctly, for SPARSEMEM_VMEMMAP to be true,
> SPARSEMEM_VMEMMAP_ENABLE needs to be true as well.
> 
> Am I missing something?

Take your config and set
	X86_5LEVEL=n
(because it enforces SPARSEMEM_VMEMMAP)
and
	SPARSEMEM_VMEMMAP=n

When compiling, you'll end up with a config like
	CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
	# CONFIG_SPARSEMEM_VMEMMAP is not set

Yet, with your patch you would get

ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y

And it would not get fenced off in the code, right?


I think you either have to check (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) 
in addition in your code, or enforce it differently. Like


config MHP_MEMMAP_ON_MEMORY
	depends on SPARSEMEM_VMEMMAP && ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
	bool


Then you can simplify the arch Kconfig settings, removing the sparesemem 
dependency there.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  2021-02-25 18:26   ` David Hildenbrand
  2021-02-26 12:06     ` Oscar Salvador
@ 2021-03-01 10:06     ` Oscar Salvador
  2021-03-01 10:24       ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: Oscar Salvador @ 2021-03-01 10:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Thu, Feb 25, 2021 at 07:26:50PM +0100, David Hildenbrand wrote:
> On 09.02.21 14:38, Oscar Salvador wrote:
> > When struct page's size is not multiple of PMD, these do not get
> > fully populated when adding sections, hence two sections will
> > intersect the same the PMD. This goes against the vmemmap-per-device
> > premise, so reject it if that is the case.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > ---
> >   mm/memory_hotplug.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 95695483a622..d3fb036d33fd 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -57,10 +57,11 @@ static __meminit int memmap_on_memory_store(const char *val,
> >   					    const struct kernel_param *kp)
> >   {
> >   	/*
> > -	 * Fail silently in case we cannot enable it due to platform constraints.
> > +	 * Fail silently in case we cannot enable it due to system constraints.
> >   	 * User can always check whether it is enabled or not via /sys/module.
> >   	 */
> > -	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
> > +	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
> > +	    (PMD_SIZE % sizeof(struct page)))
> >   		return 0;
> >   	return param_set_bool(val, kp);
> > 
> 
> Squash that into #1 - it's part of basic operation.

Just to be sure we are on the same page.
You mean to move this patch and patch#5 into the runtime check in
mhp_supports_memmap_on_memory(), right?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  2021-03-01  8:45       ` David Hildenbrand
@ 2021-03-01 10:07         ` Oscar Salvador
  0 siblings, 0 replies; 31+ messages in thread
From: Oscar Salvador @ 2021-03-01 10:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On Mon, Mar 01, 2021 at 09:45:07AM +0100, David Hildenbrand wrote:
> Take your config and set
> 	X86_5LEVEL=n
> (because it enforces SPARSEMEM_VMEMMAP)
> and
> 	SPARSEMEM_VMEMMAP=n
> 
> When compiling, you'll end up with a config like
> 	CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
> 	# CONFIG_SPARSEMEM_VMEMMAP is not set

I see

> 
> Yet, with your patch you would get
> 
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
> 
> And it would not get fenced off in the code, right?
> 
> 
> I think you either have to check (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) in
> addition in your code, or enforce it differently. Like
> 
> 
> config MHP_MEMMAP_ON_MEMORY
> 	depends on SPARSEMEM_VMEMMAP && ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> 	bool
> 
> 
> Then you can simplify the arch Kconfig settings, removing the sparesemem
> dependency there.

Yes, will do it this way.

Thanks!

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD
  2021-03-01 10:06     ` Oscar Salvador
@ 2021-03-01 10:24       ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-03-01 10:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Michal Hocko, VlastimilBabkavbabka,
	pasha.tatashin, linux-kernel, linux-mm, Anshuman Khandual

On 01.03.21 11:06, Oscar Salvador wrote:
> On Thu, Feb 25, 2021 at 07:26:50PM +0100, David Hildenbrand wrote:
>> On 09.02.21 14:38, Oscar Salvador wrote:
>>> When struct page's size is not multiple of PMD, these do not get
>>> fully populated when adding sections, hence two sections will
>>> intersect the same the PMD. This goes against the vmemmap-per-device
>>> premise, so reject it if that is the case.
>>>
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>>> ---
>>>    mm/memory_hotplug.c | 5 +++--
>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 95695483a622..d3fb036d33fd 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -57,10 +57,11 @@ static __meminit int memmap_on_memory_store(const char *val,
>>>    					    const struct kernel_param *kp)
>>>    {
>>>    	/*
>>> -	 * Fail silently in case we cannot enable it due to platform constraints.
>>> +	 * Fail silently in case we cannot enable it due to system constraints.
>>>    	 * User can always check whether it is enabled or not via /sys/module.
>>>    	 */
>>> -	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE))
>>> +	if (!IS_ENABLED(CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE) ||
>>> +	    (PMD_SIZE % sizeof(struct page)))
>>>    		return 0;
>>>    	return param_set_bool(val, kp);
>>>
>>
>> Squash that into #1 - it's part of basic operation.
> 
> Just to be sure we are on the same page.
> You mean to move this patch and patch#5 into the runtime check in
> mhp_supports_memmap_on_memory(), right?

Yes, then the compiler might even be able to figure out that what 
follows the check is dead code.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-03-01 10:25 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-09 13:38 [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
2021-02-09 13:38 ` [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2021-02-25 18:58   ` David Hildenbrand
2021-02-28 18:50     ` Oscar Salvador
2021-03-01  8:30       ` David Hildenbrand
2021-03-01  8:32       ` David Hildenbrand
2021-02-09 13:38 ` [PATCH v2 2/7] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
2021-02-09 13:38 ` [PATCH v2 3/7] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
2021-02-25 18:25   ` David Hildenbrand
2021-02-26 12:14     ` Oscar Salvador
2021-03-01  8:27       ` David Hildenbrand
2021-02-09 13:38 ` [PATCH v2 4/7] mm,memory_hotplug: Enforce struct page size to be multiple of PMD Oscar Salvador
2021-02-25 18:26   ` David Hildenbrand
2021-02-26 12:06     ` Oscar Salvador
2021-03-01 10:06     ` Oscar Salvador
2021-03-01 10:24       ` David Hildenbrand
2021-02-09 13:38 ` [PATCH v2 5/7] mm,memory_hotplug: Enforce pageblock alignment when memmap_on_memory Oscar Salvador
2021-02-25 18:27   ` David Hildenbrand
2021-02-26 12:06     ` Oscar Salvador
2021-02-09 13:38 ` [PATCH v2 6/7] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
2021-02-25 18:29   ` David Hildenbrand
2021-02-26 12:04     ` Oscar Salvador
2021-03-01  8:45       ` David Hildenbrand
2021-03-01 10:07         ` Oscar Salvador
2021-02-09 13:38 ` [PATCH v2 7/7] arm64/Kconfig: " Oscar Salvador
2021-02-25 18:29   ` David Hildenbrand
2021-02-26 12:10     ` Oscar Salvador
2021-02-17 10:18 ` [PATCH v2 0/7] Allocate memmap from hotadded memory (per device) Oscar Salvador
2021-02-22 11:15   ` Oscar Salvador
2021-02-22 11:28     ` David Hildenbrand
2021-02-23  7:48       ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).