Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Date: Fri, 16 Apr 2021 12:33:34 +0200	[thread overview]
Message-ID: <df8220ac-4214-5ff6-0048-35553fea8c8c@redhat.com> (raw)
In-Reply-To: <20210416102153.8794-5-osalvador@suse.de>

On 16.04.21 12:21, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>   a) an existing memory is consumed for that purpose
>      (eg: ~2MB per 128MB memory section on x86_64)
>   b) if the whole node is movable then we have off-node struct pages
>      which has performance drawbacks.
>   c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>      populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This patch also introduces the following functions:
> 
>   - mhp_init_memmap_on_memory:
> 		       Initializes vmemmap pages by calling move_pfn_range_to_zone(),
> 		       calls kasan_add_zero_shadow(), and onlines as many sections
> 		       as vmemmap pages fully span.
>   - mhp_deinit_memmap_on_memory:
> 		       Undoes what mhp_init_memmap_on_memory.
> 
> The new function memory_block_online() calls mhp_init_memmap_on_memory() before
> doing the actual online_pages(). Should online_pages() fail, we clean up
> by calling mhp_deinit_memmap_on_memory().
> Adjusting of present_pages is done at the end once we know that online_pages()
> succedeed.
> 
> On offline, memory_block_offline() needs to unaccount vmemmap pages from
> present_pages() before calling offline_pages().
> This is necessary because offline_pages() tears down some structures based
> on the fact whether the node or the zone become empty.
> If offline_pages() fails, we account back vmemmap pages.
> If it succeeds, we call mhp_deinit_memmap_on_memory().
> 
> Hot-remove:
> 
>   We need to be careful when removing memory, as adding and
>   removing memory needs to be done with the same granularity.
>   To check that this assumption is not violated, we check the
>   memory range we want to remove and if a) any memory block has
>   vmemmap pages and b) the range spans more than a single memory
>   block, we scream out loud and refuse to proceed.
> 
>   If all is good and the range was using memmap on memory (aka vmemmap pages),
>   we construct an altmap structure so free_hugepage_table does the right
>   thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   drivers/base/memory.c          |  75 ++++++++++++++++--
>   include/linux/memory.h         |   8 +-
>   include/linux/memory_hotplug.h |  17 +++-
>   include/linux/memremap.h       |   2 +-
>   include/linux/mmzone.h         |   7 +-
>   mm/Kconfig                     |   5 ++
>   mm/memory_hotplug.c            | 171 ++++++++++++++++++++++++++++++++++++++---
>   mm/sparse.c                    |   2 -
>   8 files changed, 265 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f209925a5d4e..179857d53982 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -173,16 +173,76 @@ static int memory_block_online(struct memory_block *mem)
>   {
>   	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>   	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone = mhp_get_target_zone(start_pfn, nr_pages, mem->nid,
> +				   mem->online_type);
> +
> +	/*
> +	 * Although vmemmap pages have a different lifecycle than the pages
> +	 * they describe (they remain until the memory is unplugged), doing
> +	 * its initialization and accounting at hot-{online,offline} stage

s/its/their/

s/hot-{online,offline}/memory onlining/offlining stage/

> +	 * simplifies things a lot
> +	 */
> +	if (nr_vmemmap_pages) {
> +		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = online_pages(start_pfn + nr_vmemmap_pages,
> +			   nr_pages - nr_vmemmap_pages, zone);
> +	if (ret) {
> +		if (nr_vmemmap_pages)
> +			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Account once onlining succeeded. If the page was unpopulated, it is

s/page/zone/

> +	 * now already properly populated.
> +	 */
> +	if (nr_vmemmap_pages)
> +		adjust_present_page_count(zone, nr_vmemmap_pages);
>   
> -	return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> +	return ret;
>   }
>   
>   static int memory_block_offline(struct memory_block *mem)
>   {
>   	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>   	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
>   
> -	return offline_pages(start_pfn, nr_pages);
> +	/*
> +	 * Unaccount before offlining, such that unpopulated zone and kthreads
> +	 * can properly be torn down in offline_pages().
> +	 */
> +	if (nr_vmemmap_pages)
> +		adjust_present_page_count(zone, -nr_vmemmap_pages);
> +
> +	ret = offline_pages(start_pfn + nr_vmemmap_pages,
> +			    nr_pages - nr_vmemmap_pages);
> +	if (ret) {
> +		/* offline_pages() failed. Account back. */
> +		if (nr_vmemmap_pages)
> +			adjust_present_page_count(zone, nr_vmemmap_pages);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Re-adjust present pages if offline_pages() fails.
> +	 */

That comment is stale. I'd just drop it.

> +	if (nr_vmemmap_pages)
> +		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> +
> +	return ret;
>   }

[...]

> -static void adjust_present_page_count(struct zone *zone, long nr_pages)
> +/*
> + * This function should only be called by memory_block_{online,offline},
> + * and {online,offline}_pages.
> + */
> +void adjust_present_page_count(struct zone *zone, long nr_pages)
>   {
>   	unsigned long flags;
>   
> @@ -839,12 +850,64 @@ static void adjust_present_page_count(struct zone *zone, long nr_pages)
>   	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>   }
>   
> -int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       int online_type, int nid)
> +struct zone *mhp_get_target_zone(unsigned long pfn, unsigned long nr_pages,
> +				 int nid, int online_type)
> +{
> +	return zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> +}
> +

Oh, you can just use zone_for_pfn_range() directly for now. No need for 
mhp_get_target_zone(). Sorry for not realizing this.

> +int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
> +			      struct zone *zone)
> +{
> +	unsigned long end_pfn = pfn + nr_pages;
> +	int ret;
> +
> +	/*
> +	 * Initialize vmemmap pages with the corresponding node, zone links set.
> +	 */
> +	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> +
> +	ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
> +	if (ret) {
> +		remove_pfn_range_from_zone(zone, pfn, nr_pages);
> +		return ret;
> +	}

IIRC, we have to add the zero shadow first, before touching the memory. 
This is also what mm/memremap.c does.

In mhp_deinit_memmap_on_memory(), you already remove in the proper 
(reversed) order :)

> +
> +int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
>   {
>   	unsigned long flags;
> -	struct zone *zone;
>   	int need_zonelists_rebuild = 0;
> +	int nid;
>   	int ret;
>   	struct memory_notify arg;
>   
> @@ -860,8 +923,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>   
>   	mem_hotplug_begin();
>   
> +	nid = zone_to_nid(zone);

I'd do that right above

const int nid = zone_to_nid(zone);

[...]

-- 
Thanks,

David / dhildenb