mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, anshuman.khandual@arm.com,
	david@redhat.com, linux-mm@kvack.org, mhocko@suse.com,
	mm-commits@vger.kernel.org, osalvador@suse.de,
	pasha.tatashin@soleen.com, torvalds@linux-foundation.org,
	vbabka@suse.cz
Subject: [patch 128/143] mm,memory_hotplug: allocate memmap from the added memory range
Date: Tue, 04 May 2021 18:39:42 -0700	[thread overview]
Message-ID: <20210505013942.NUhfLGPRr%akpm@linux-foundation.org> (raw)
In-Reply-To: <20210504183219.a3cc46aee4013d77402276c5@linux-foundation.org>

From: Oscar Salvador <osalvador@suse.de>
Subject: mm,memory_hotplug: allocate memmap from the added memory range

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section.  Currently, alloc_pages_node() is used for
those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
    This can even lead to extreme cases where system goes OOM because
    the physically hotplugged memory depletes the available memory before
    it is onlined.
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.  That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables.  This implementation uses the beginning of the hotplugged memory
for that purpose.

There are some non-obviously things to consider though.  Vmemmap pages are
allocated/freed during the memory hotplug events (add_memory_resource(),
try_remove_memory()) when the memory is added/removed.  This means that
the reserved physical range is not online although it is used.  The most
obvious side effect is that pfn_to_online_page() returns NULL for those
pfns.  The current design expects that this should be OK as the hotplugged
memory is considered a garbage until it is onlined.  For example
hibernation wouldn't save the content of those vmmemmaps into the image so
it wouldn't be restored on resume but this should be OK as there no real
content to recover anyway while metadata is reachable from other data
structures (e.g.  vmemmap page tables).

The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
allocator independent initialization from the regular onlining path.  The
primary reason to handle the reserved space outside of {on,off}line_pages
is to make each initialization specific to the purpose rather than special
case them in a single function.

As per above, the functions that are introduced are:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
   kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
   fully span.

 - mhp_deinit_memmap_on_memory:
   Offlines as many sections as vmemmap pages fully span, removes the
   range from zhe zone by remove_pfn_range_from_zone(), and calls
   kasan_remove_zero_shadow() for the range.

The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages().  Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().  This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty.  If offline_pages() fails, we account back
vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |   72 ++++++++++++-
 include/linux/memory.h         |    8 +
 include/linux/memory_hotplug.h |   15 ++
 include/linux/memremap.h       |    2 
 include/linux/mmzone.h         |    7 -
 mm/Kconfig                     |    5 
 mm/memory_hotplug.c            |  161 +++++++++++++++++++++++++++++--
 mm/sparse.c                    |    2 
 8 files changed, 250 insertions(+), 22 deletions(-)

--- a/drivers/base/memory.c~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/drivers/base/memory.c
@@ -173,16 +173,73 @@ static int memory_block_online(struct me
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	struct zone *zone;
+	int ret;
+
+	zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
+
+	/*
+	 * Although vmemmap pages have a different lifecycle than the pages
+	 * they describe (they remain until the memory is unplugged), doing
+	 * their initialization and accounting at memory onlining/offlining
+	 * stage helps to keep accounting easier to follow - e.g vmemmaps
+	 * belong to the same zone as the memory they backed.
+	 */
+	if (nr_vmemmap_pages) {
+		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
+		if (ret)
+			return ret;
+	}
+
+	ret = online_pages(start_pfn + nr_vmemmap_pages,
+			   nr_pages - nr_vmemmap_pages, zone);
+	if (ret) {
+		if (nr_vmemmap_pages)
+			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+		return ret;
+	}
+
+	/*
+	 * Account once onlining succeeded. If the zone was unpopulated, it is
+	 * now already properly populated.
+	 */
+	if (nr_vmemmap_pages)
+		adjust_present_page_count(zone, nr_vmemmap_pages);
 
-	return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+	return ret;
 }
 
 static int memory_block_offline(struct memory_block *mem)
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+
+	/*
+	 * Unaccount before offlining, such that unpopulated zone and kthreads
+	 * can properly be torn down in offline_pages().
+	 */
+	if (nr_vmemmap_pages)
+		adjust_present_page_count(zone, -nr_vmemmap_pages);
 
-	return offline_pages(start_pfn, nr_pages);
+	ret = offline_pages(start_pfn + nr_vmemmap_pages,
+			    nr_pages - nr_vmemmap_pages);
+	if (ret) {
+		/* offline_pages() failed. Account back. */
+		if (nr_vmemmap_pages)
+			adjust_present_page_count(zone, nr_vmemmap_pages);
+		return ret;
+	}
+
+	if (nr_vmemmap_pages)
+		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+
+	return ret;
 }
 
 /*
@@ -576,7 +633,8 @@ int register_memory(struct memory_block
 	return ret;
 }
 
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+			     unsigned long nr_vmemmap_pages)
 {
 	struct memory_block *mem;
 	int ret = 0;
@@ -593,6 +651,7 @@ static int init_memory_block(unsigned lo
 	mem->start_section_nr = block_id * sections_per_block;
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
+	mem->nr_vmemmap_pages = nr_vmemmap_pages;
 
 	ret = register_memory(mem);
 
@@ -612,7 +671,7 @@ static int add_memory_block(unsigned lon
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -634,7 +693,8 @@ static void unregister_memory(struct mem
  *
  * Called under device_hotplug_lock.
  */
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -647,7 +707,7 @@ int create_memory_block_devices(unsigned
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
 		if (ret)
 			break;
 	}
--- a/include/linux/memory.h~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/include/linux/memory.h
@@ -29,6 +29,11 @@ struct memory_block {
 	int online_type;		/* for passing data to online routine */
 	int nid;			/* NID for this memory block */
 	struct device dev;
+	/*
+	 * Number of vmemmap pages. These pages
+	 * lay at the beginning of the memory block.
+	 */
+	unsigned long nr_vmemmap_pages;
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@ static inline int memory_notify(unsigned
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
--- a/include/linux/memory_hotplug.h~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/include/linux/memory_hotplug.h
@@ -56,6 +56,14 @@ typedef int __bitwise mhp_t;
 #define MHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
 /*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+
+/*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
  * pgprot: page protection flags to apply to newly created page tables
@@ -99,9 +107,13 @@ static inline void zone_seqlock_init(str
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
+extern void adjust_present_page_count(struct zone *zone, long nr_pages);
 /* VM interface that may be used by firmware interface */
+extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
+				     struct zone *zone);
+extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			int online_type, int nid);
+			struct zone *zone);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -359,6 +371,7 @@ extern struct zone *zone_for_pfn_range(i
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
--- a/include/linux/memremap.h~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/include/linux/memremap.h
@@ -17,7 +17,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
--- a/include/linux/mmzone.h~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/include/linux/mmzone.h
@@ -436,6 +436,11 @@ enum zone_type {
 	 *    situations where ZERO_PAGE(0) which is allocated differently
 	 *    on different platforms may end up in a movable zone. ZERO_PAGE(0)
 	 *    cannot be migrated.
+	 * 7. Memory-hotplug: when using memmap_on_memory and onlining the
+	 *    memory to the MOVABLE zone, the vmemmap pages are also placed in
+	 *    such zone. Such pages cannot be really moved around as they are
+	 *    self-stored in the range, but they are treated as movable when
+	 *    the range they describe is about to be offlined.
 	 *
 	 * In general, no unmovable allocations that degrade memory offlining
 	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
@@ -1392,10 +1397,8 @@ static inline int online_section_nr(unsi
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
 #endif
-#endif
 
 static inline struct mem_section *__pfn_to_section(unsigned long pfn)
 {
--- a/mm/Kconfig~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/mm/Kconfig
@@ -188,6 +188,11 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config MHP_MEMMAP_ON_MEMORY
+	def_bool y
+	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
+	depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
--- a/mm/memory_hotplug.c~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
 #include "internal.h"
 #include "shuffle.h"
 
+static bool memmap_on_memory;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -648,9 +650,16 @@ static void online_pages_range(unsigned
 	 * decide to not expose all pages to the buddy (e.g., expose them
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
+	 * When using memmap_on_memory, the range might not be aligned to
+	 * MAX_ORDER_NR_PAGES - 1, but pageblock aligned. __ffs() will detect
+	 * this and the first chunk to online will be pageblock_nr_pages.
 	 */
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
-		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
+	for (pfn = start_pfn; pfn < end_pfn;) {
+		int order = min(MAX_ORDER - 1UL, __ffs(pfn));
+
+		(*online_page_callback)(pfn_to_page(pfn), order);
+		pfn += (1UL << order);
+	}
 
 	/* mark all involved sections as online */
 	online_mem_sections(start_pfn, end_pfn);
@@ -829,7 +838,11 @@ struct zone * zone_for_pfn_range(int onl
 	return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
-static void adjust_present_page_count(struct zone *zone, long nr_pages)
+/*
+ * This function should only be called by memory_block_{online,offline},
+ * and {online,offline}_pages.
+ */
+void adjust_present_page_count(struct zone *zone, long nr_pages)
 {
 	unsigned long flags;
 
@@ -839,12 +852,54 @@ static void adjust_present_page_count(st
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
+			      struct zone *zone)
+{
+	unsigned long end_pfn = pfn + nr_pages;
+	int ret;
+
+	ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
+	if (ret)
+		return ret;
+
+	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
+
+	/*
+	 * It might be that the vmemmap_pages fully span sections. If that is
+	 * the case, mark those sections online here as otherwise they will be
+	 * left offline.
+	 */
+	if (nr_pages >= PAGES_PER_SECTION)
+	        online_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION));
+
+	return ret;
+}
+
+void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
+{
+	unsigned long end_pfn = pfn + nr_pages;
+
+	/*
+	 * It might be that the vmemmap_pages fully span sections. If that is
+	 * the case, mark those sections offline here as otherwise they will be
+	 * left online.
+	 */
+	if (nr_pages >= PAGES_PER_SECTION)
+		offline_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION));
+
+        /*
+	 * The pages associated with this vmemmap have been offlined, so
+	 * we can reset its state here.
+	 */
+	remove_pfn_range_from_zone(page_zone(pfn_to_page(pfn)), pfn, nr_pages);
+	kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
+}
+
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
 {
 	unsigned long flags;
-	struct zone *zone;
 	int need_zonelists_rebuild = 0;
+	const int nid = zone_to_nid(zone);
 	int ret;
 	struct memory_notify arg;
 
@@ -863,7 +918,6 @@ int __ref online_pages(unsigned long pfn
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
-	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
@@ -1077,6 +1131,45 @@ static int online_memory_block(struct me
 	return device_online(&mem->dev);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+	unsigned long remaining_size = size - vmemmap_size;
+
+	/*
+	 * Besides having arch support and the feature enabled at runtime, we
+	 * need a few more assumptions to hold true:
+	 *
+	 * a) We span a single memory block: memory onlining/offlinin;g happens
+	 *    in memory block granularity. We don't want the vmemmap of online
+	 *    memory blocks to reside on offline memory blocks. In the future,
+	 *    we might want to support variable-sized memory blocks to make the
+	 *    feature more versatile.
+	 *
+	 * b) The vmemmap pages span complete PMDs: We don't want vmemmap code
+	 *    to populate memory from the altmap for unrelated parts (i.e.,
+	 *    other memory blocks)
+	 *
+	 * c) The vmemmap pages (and thereby the pages that will be exposed to
+	 *    the buddy) have to cover full pageblocks: memory onlining/offlining
+	 *    code requires applicable ranges to be page-aligned, for example, to
+	 *    set the migratetypes properly.
+	 *
+	 * TODO: Although we have a check here to make sure that vmemmap pages
+	 *       fully populate a PMD, it is not the right place to check for
+	 *       this. A much better solution involves improving vmemmap code
+	 *       to fallback to base pages when trying to populate vmemmap using
+	 *       altmap as an alternative source of memory, and we do not exactly
+	 *       populate a single PMD.
+	 */
+	return memmap_on_memory &&
+	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
+	       size == memory_block_size_bytes() &&
+	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1086,6 +1179,7 @@ static int online_memory_block(struct me
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	struct vmem_altmap mhp_altmap = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1112,13 +1206,26 @@ int __ref add_memory_resource(int nid, s
 		goto error;
 	new_node = ret;
 
+	/*
+	 * Self hosted memmap array
+	 */
+	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+		if (!mhp_supports_memmap_on_memory(size)) {
+			ret = -EINVAL;
+			goto error;
+		}
+		mhp_altmap.free = PHYS_PFN(size);
+		mhp_altmap.base_pfn = PHYS_PFN(start);
+		params.altmap = &mhp_altmap;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
 		arch_remove_memory(nid, start, size, NULL);
 		goto error;
@@ -1767,6 +1874,14 @@ static int check_memblock_offlined_cb(st
 	return 0;
 }
 
+static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+	/*
+	 * If not set, continue with the next block.
+	 */
+	return mem->nr_vmemmap_pages;
+}
+
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1841,6 +1956,9 @@ EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	unsigned long nr_vmemmap_pages;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1853,6 +1971,31 @@ static int __ref try_remove_memory(int n
 	if (rc)
 		return rc;
 
+	/*
+	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
+	 * the same granularity it was added - a single memory block.
+	 */
+	if (memmap_on_memory) {
+		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
+						      get_nr_vmemmap_pages_cb);
+		if (nr_vmemmap_pages) {
+			if (size != memory_block_size_bytes()) {
+				pr_warn("Refuse to remove %#llx - %#llx,"
+					"wrong granularity\n",
+					start, start + size);
+				return -EINVAL;
+			}
+
+			/*
+			 * Let remove_pmd_table->free_hugepage_table do the
+			 * right thing if we used vmem_altmap when hot-adding
+			 * the range.
+			 */
+			mhp_altmap.alloc = nr_vmemmap_pages;
+			altmap = &mhp_altmap;
+		}
+	}
+
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1864,7 +2007,7 @@ static int __ref try_remove_memory(int n
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, NULL);
+	arch_remove_memory(nid, start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
--- a/mm/sparse.c~mmmemory_hotplug-allocate-memmap-from-the-added-memory-range
+++ a/mm/sparse.c
@@ -624,7 +624,6 @@ void online_mem_sections(unsigned long s
 	}
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 /* Mark all memory sections within the pfn range as offline */
 void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 {
@@ -645,7 +644,6 @@ void offline_mem_sections(unsigned long
 		ms->section_mem_map &= ~SECTION_IS_ONLINE;
 	}
 }
-#endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
_

  parent reply	other threads:[~2021-05-05  1:39 UTC|newest]

Thread overview: 146+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-05  1:32 incoming Andrew Morton
2021-05-05  1:32 ` [patch 001/143] mm: introduce and use mapping_empty() Andrew Morton
2021-05-05  1:32 ` [patch 002/143] mm: stop accounting shadow entries Andrew Morton
2021-05-05  1:32 ` [patch 003/143] dax: account DAX entries as nrpages Andrew Morton
2021-05-05  1:32 ` [patch 004/143] mm: remove nrexceptional from inode Andrew Morton
2021-05-05  1:32 ` [patch 005/143] mm: remove nrexceptional from inode: remove BUG_ON Andrew Morton
2021-05-05  1:33 ` [patch 006/143] hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() Andrew Morton
2021-05-05  1:33 ` [patch 007/143] hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabled Andrew Morton
2021-05-05  1:33 ` [patch 008/143] mm/hugetlb: move flush_hugetlb_tlb_range() into hugetlb.h Andrew Morton
2021-05-05  1:33 ` [patch 009/143] hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp Andrew Morton
2021-05-05  1:33 ` [patch 010/143] mm/hugetlb: remove redundant reservation check condition in alloc_huge_page() Andrew Morton
2021-05-05  1:33 ` [patch 011/143] mm: generalize HUGETLB_PAGE_SIZE_VARIABLE Andrew Morton
2021-05-05  1:33 ` [patch 012/143] mm/hugetlb: use some helper functions to cleanup code Andrew Morton
2021-05-05  1:33 ` [patch 013/143] mm/hugetlb: optimize the surplus state transfer code in move_hugetlb_state() Andrew Morton
2021-05-05  1:33 ` [patch 014/143] mm/hugetlb_cgroup: remove unnecessary VM_BUG_ON_PAGE in hugetlb_cgroup_migrate() Andrew Morton
2021-05-05  1:33 ` [patch 015/143] mm/hugetlb: simplify the code when alloc_huge_page() failed in hugetlb_no_page() Andrew Morton
2021-05-05  1:33 ` [patch 016/143] mm/hugetlb: avoid calculating fault_mutex_hash in truncate_op case Andrew Morton
2021-05-05  1:33 ` [patch 017/143] khugepaged: remove unneeded return value of khugepaged_collapse_pte_mapped_thps() Andrew Morton
2021-05-05  1:33 ` [patch 018/143] khugepaged: reuse the smp_wmb() inside __SetPageUptodate() Andrew Morton
2021-05-05  1:33 ` [patch 019/143] khugepaged: use helper khugepaged_test_exit() in __khugepaged_enter() Andrew Morton
2021-05-05  1:33 ` [patch 020/143] khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate() Andrew Morton
2021-05-05  1:33 ` [patch 021/143] mm/huge_memory.c: remove unnecessary local variable ret2 Andrew Morton
2021-05-05  1:33 ` [patch 022/143] mm/huge_memory.c: rework the function vma_adjust_trans_huge() Andrew Morton
2021-05-05  1:33 ` [patch 023/143] mm/huge_memory.c: make get_huge_zero_page() return bool Andrew Morton
2021-05-05  1:33 ` [patch 024/143] mm/huge_memory.c: rework the function do_huge_pmd_numa_page() slightly Andrew Morton
2021-05-05  1:34 ` [patch 025/143] mm/huge_memory.c: remove redundant PageCompound() check Andrew Morton
2021-05-05  1:34 ` [patch 026/143] mm/huge_memory.c: remove unused macro TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG Andrew Morton
2021-05-05  1:34 ` [patch 027/143] mm/huge_memory.c: use helper function migration_entry_to_page() Andrew Morton
2021-05-05  1:34 ` [patch 028/143] mm/khugepaged.c: replace barrier() with READ_ONCE() for a selective variable Andrew Morton
2021-05-05  1:34 ` [patch 029/143] khugepaged: use helper function range_in_vma() in collapse_pte_mapped_thp() Andrew Morton
2021-05-05  1:34 ` [patch 030/143] khugepaged: remove unnecessary out label in collapse_huge_page() Andrew Morton
2021-05-05  1:34 ` [patch 031/143] khugepaged: remove meaningless !pte_present() check in khugepaged_scan_pmd() Andrew Morton
2021-05-05  1:34 ` [patch 032/143] mm: huge_memory: a new debugfs interface for splitting THP tests Andrew Morton
2021-05-05  1:34 ` [patch 033/143] mm: huge_memory: debugfs for file-backed THP split Andrew Morton
2021-05-05  1:34 ` [patch 034/143] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Andrew Morton
2021-05-05  1:34 ` [patch 035/143] mm/hugeltb: simplify the return code of __vma_reservation_common() Andrew Morton
2021-05-05  1:34 ` [patch 036/143] mm/hugeltb: clarify (chg - freed) won't go negative in hugetlb_unreserve_pages() Andrew Morton
2021-05-05  1:34 ` [patch 037/143] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() Andrew Morton
2021-05-05  1:34 ` [patch 038/143] mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages() Andrew Morton
2021-05-05  1:34 ` [patch 039/143] mm/cma: change cma mutex to irq safe spinlock Andrew Morton
2021-05-05  1:34 ` [patch 040/143] hugetlb: no need to drop hugetlb_lock to call cma_release Andrew Morton
2021-05-05  1:34 ` [patch 041/143] hugetlb: add per-hstate mutex to synchronize user adjustments Andrew Morton
2021-05-05  1:34 ` [patch 042/143] hugetlb: create remove_hugetlb_page() to separate functionality Andrew Morton
2021-05-05  1:34 ` [patch 043/143] hugetlb: call update_and_free_page without hugetlb_lock Andrew Morton
2021-05-05  1:35 ` [patch 044/143] hugetlb: change free_pool_huge_page to remove_pool_huge_page Andrew Morton
2021-05-05  1:35 ` [patch 045/143] hugetlb: make free_huge_page irq safe Andrew Morton
2021-05-05  1:35 ` [patch 046/143] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Andrew Morton
2021-05-05  1:35 ` [patch 047/143] mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range Andrew Morton
2021-05-05  1:35 ` [patch 048/143] mm,compaction: let isolate_migratepages_{range,block} return error codes Andrew Morton
2021-05-05  1:35 ` [patch 049/143] mm,hugetlb: drop clearing of flag from prep_new_huge_page Andrew Morton
2021-05-05  1:35 ` [patch 050/143] mm,hugetlb: split prep_new_huge_page functionality Andrew Morton
2021-05-05  1:35 ` [patch 051/143] mm: make alloc_contig_range handle free hugetlb pages Andrew Morton
2021-05-05  1:35 ` [patch 052/143] mm: make alloc_contig_range handle in-use " Andrew Morton
2021-05-05  1:35 ` [patch 053/143] mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig Andrew Morton
2021-05-05  1:35 ` [patch 054/143] userfaultfd: add minor fault registration mode Andrew Morton
2021-05-05  1:35 ` [patch 055/143] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Andrew Morton
2021-05-05  1:35 ` [patch 056/143] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled Andrew Morton
2021-05-05  1:35 ` [patch 057/143] userfaultfd: add UFFDIO_CONTINUE ioctl Andrew Morton
2021-05-05  1:35 ` [patch 058/143] userfaultfd: update documentation to describe minor fault handling Andrew Morton
2021-05-05  1:35 ` [patch 059/143] userfaultfd/selftests: add test exercising " Andrew Morton
2021-05-05  1:36 ` [patch 060/143] mm/vmscan: move RECLAIM* bits to uapi header Andrew Morton
2021-05-05  1:36 ` [patch 061/143] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks Andrew Morton
2021-05-05  1:36 ` [patch 062/143] mm: vmscan: use nid from shrink_control for tracepoint Andrew Morton
2021-05-05  1:36 ` [patch 063/143] mm: vmscan: consolidate shrinker_maps handling code Andrew Morton
2021-05-05  1:36 ` [patch 064/143] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Andrew Morton
2021-05-05  1:36 ` [patch 065/143] mm: vmscan: remove memcg_shrinker_map_size Andrew Morton
2021-05-05  1:36 ` [patch 066/143] mm: vmscan: use kvfree_rcu instead of call_rcu Andrew Morton
2021-05-05  1:36 ` [patch 067/143] mm: memcontrol: rename shrinker_map to shrinker_info Andrew Morton
2021-05-05  1:36 ` [patch 068/143] mm: vmscan: add shrinker_info_protected() helper Andrew Morton
2021-05-05  1:36 ` [patch 069/143] mm: vmscan: use a new flag to indicate shrinker is registered Andrew Morton
2021-05-05  1:36 ` [patch 070/143] mm: vmscan: add per memcg shrinker nr_deferred Andrew Morton
2021-05-05  1:36 ` [patch 071/143] mm: vmscan: use per memcg nr_deferred of shrinker Andrew Morton
2021-05-05  1:36 ` [patch 072/143] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Andrew Morton
2021-05-05  1:36 ` [patch 073/143] mm: memcontrol: reparent nr_deferred when memcg offline Andrew Morton
2021-05-05  1:36 ` [patch 074/143] mm: vmscan: shrink deferred objects proportional to priority Andrew Morton
2021-05-05  1:36 ` [patch 075/143] mm/compaction: remove unused variable sysctl_compact_memory Andrew Morton
2021-05-05  1:36 ` [patch 076/143] mm: compaction: update the COMPACT[STALL|FAIL] events properly Andrew Morton
2021-05-05  1:36 ` [patch 077/143] mm: disable LRU pagevec during the migration temporarily Andrew Morton
2021-05-05  1:36 ` [patch 078/143] mm: replace migrate_[prep|finish] with lru_cache_[disable|enable] Andrew Morton
2021-05-05  1:37 ` [patch 079/143] mm: fs: invalidate BH LRU during page migration Andrew Morton
2021-05-05  1:37 ` [patch 080/143] mm/migrate.c: make putback_movable_page() static Andrew Morton
2021-05-05  1:37 ` [patch 081/143] mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case Andrew Morton
2021-05-05  1:37 ` [patch 082/143] mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page() Andrew Morton
2021-05-05  1:37 ` [patch 083/143] mm/migrate.c: use helper migrate_vma_collect_skip() in migrate_vma_collect_hole() Andrew Morton
2021-05-05  1:37 ` [patch 084/143] Revert "mm: migrate: skip shared exec THP for NUMA balancing" Andrew Morton
2021-05-05  1:37 ` [patch 085/143] mm: vmstat: add cma statistics Andrew Morton
2021-05-05  1:37 ` [patch 086/143] mm: cma: use pr_err_ratelimited for CMA warning Andrew Morton
2021-05-05  1:37 ` [patch 087/143] mm: cma: add trace events for CMA alloc perf testing Andrew Morton
2021-05-05  1:37 ` [patch 088/143] mm: cma: support sysfs Andrew Morton
2021-05-05  1:37 ` [patch 089/143] mm: cma: add the CMA instance name to cma trace events Andrew Morton
2021-05-05  1:37 ` [patch 090/143] mm: use proper type for cma_[alloc|release] Andrew Morton
2021-05-05  1:37 ` [patch 091/143] ksm: remove redundant VM_BUG_ON_PAGE() on stable_tree_search() Andrew Morton
2021-05-05  1:37 ` [patch 092/143] ksm: use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree() Andrew Morton
2021-05-05  1:37 ` [patch 093/143] ksm: remove dedicated macro KSM_FLAG_MASK Andrew Morton
2021-05-05  1:37 ` [patch 094/143] ksm: fix potential missing rmap_item for stable_node Andrew Morton
2021-05-05  1:37 ` [patch 095/143] mm/ksm: remove unused parameter from remove_trailing_rmap_items() Andrew Morton
2021-05-05  1:37 ` [patch 096/143] mm: restore node stat checking in /proc/sys/vm/stat_refresh Andrew Morton
2021-05-05  1:37 ` [patch 097/143] mm: no more EINVAL from /proc/sys/vm/stat_refresh Andrew Morton
2021-05-05  1:37 ` [patch 098/143] mm: /proc/sys/vm/stat_refresh skip checking known negative stats Andrew Morton
2021-05-05  1:38 ` [patch 099/143] mm: /proc/sys/vm/stat_refresh stop checking monotonic numa stats Andrew Morton
2021-05-05  1:38 ` [patch 100/143] x86/mm: track linear mapping split events Andrew Morton
2021-05-05  1:38 ` [patch 101/143] mm/mmap.c: don't unlock VMAs in remap_file_pages() Andrew Morton
2021-05-05  1:38 ` [patch 102/143] mm: generalize ARCH_HAS_CACHE_LINE_SIZE Andrew Morton
2021-05-05  1:38 ` [patch 104/143] mm: generalize ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE] Andrew Morton
2021-05-05  1:38 ` [patch 105/143] mm: drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION Andrew Morton
2021-05-05  1:38 ` [patch 108/143] mm/util.c: reduce mem_dump_obj() object size Andrew Morton
2021-05-05  1:38 ` [patch 109/143] mm/util.c: fix typo Andrew Morton
2021-05-05  1:38 ` [patch 110/143] mm/gup: don't pin migrated cma pages in movable zone Andrew Morton
2021-05-05  1:38 ` [patch 111/143] mm/gup: check every subpage of a compound page during isolation Andrew Morton
2021-05-05  1:38 ` [patch 112/143] mm/gup: return an error on migration failure Andrew Morton
2021-05-05  1:38 ` [patch 113/143] mm/gup: check for isolation errors Andrew Morton
2021-05-05  1:38 ` [patch 114/143] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN Andrew Morton
2021-05-05  1:38 ` [patch 115/143] mm: apply per-task gfp constraints in fast path Andrew Morton
2021-05-05  1:39 ` [patch 116/143] mm: honor PF_MEMALLOC_PIN for all movable pages Andrew Morton
2021-05-05  1:39 ` [patch 117/143] mm/gup: do not migrate zero page Andrew Morton
2021-05-05  1:39 ` [patch 118/143] mm/gup: migrate pinned pages out of movable zone Andrew Morton
2021-05-05  1:39 ` [patch 119/143] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning Andrew Morton
2021-05-05  1:39 ` [patch 120/143] mm/gup: change index type to long as it counts pages Andrew Morton
2021-05-05  1:39 ` [patch 121/143] mm/gup: longterm pin migration cleanup Andrew Morton
2021-05-05  1:39 ` [patch 122/143] selftests/vm: gup_test: fix test flag Andrew Morton
2021-05-05  1:39 ` [patch 123/143] selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages Andrew Morton
2021-05-05  1:39 ` [patch 124/143] mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove Andrew Morton
2021-05-05  1:39 ` [patch 125/143] drivers/base/memory: introduce memory_block_{online,offline} Andrew Morton
2021-05-05  1:39 ` [patch 126/143] mm,memory_hotplug: relax fully spanned sections check Andrew Morton
2021-05-05  1:39 ` [patch 127/143] mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count() Andrew Morton
2021-05-05  1:39 ` Andrew Morton [this message]
2021-05-05  1:39 ` [patch 129/143] acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported Andrew Morton
2021-05-05  1:39 ` [patch 130/143] mm,memory_hotplug: add kernel boot option to enable memmap_on_memory Andrew Morton
2021-05-05  1:39 ` [patch 131/143] x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Andrew Morton
2021-05-05  1:39 ` [patch 132/143] arm64/Kconfig: " Andrew Morton
2021-05-05  1:39 ` [patch 133/143] mm/zswap.c: switch from strlcpy to strscpy Andrew Morton
2021-05-05  1:40 ` [patch 134/143] mm/zsmalloc: use BUG_ON instead of if condition followed by BUG Andrew Morton
2021-05-05  1:40 ` [patch 135/143] iov_iter: lift memzero_page() to highmem.h Andrew Morton
2021-05-05  1:40 ` [patch 136/143] btrfs: use memzero_page() instead of open coded kmap pattern Andrew Morton
2021-05-05  1:40 ` [patch 137/143] mm/highmem.c: fix coding style issue Andrew Morton
2021-05-05  1:40 ` [patch 138/143] mm/mempool: minor coding style tweaks Andrew Morton
2021-05-05  1:40 ` [patch 139/143] mm/process_vm_access.c: remove duplicate include Andrew Morton
2021-05-05  1:40 ` [patch 140/143] kfence: zero guard page after out-of-bounds access Andrew Morton
2021-05-05  1:40 ` [patch 141/143] kfence: await for allocation using wait_event Andrew Morton
2021-05-05  1:40 ` [patch 142/143] kfence: maximize allocation wait timeout duration Andrew Morton
2021-05-05  1:40 ` [patch 143/143] kfence: use power-efficient work queue to run delayed work Andrew Morton
2021-05-05  1:47 ` incoming Linus Torvalds
2021-05-05  3:16   ` incoming Andrew Morton
2021-05-05 17:10     ` incoming Linus Torvalds
2021-05-05 17:44       ` incoming Andrew Morton
2021-05-06  3:19         ` incoming Anshuman Khandual

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210505013942.NUhfLGPRr%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mm-commits@vger.kernel.org \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).