Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range

From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	VlastimilBabkavbabka@suse.cz, pasha.tatashin@soleen.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range
Date: Thu, 25 Feb 2021 19:58:01 +0100	[thread overview]
Message-ID: <60afb5ca-230e-265f-9579-dac66a152c33@redhat.com> (raw)
In-Reply-To: <20210209133854.17399-2-osalvador@suse.de>

On 09.02.21 14:38, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>   a) an existing memory is consumed for that purpose
>      (eg: ~2MB per 128MB memory section on x86_64)
>   b) if the whole node is movable then we have off-node struct pages
>      which has performance drawbacks.
>   c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>      populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]       = Initialized and sent to buddy

nit: shouldn't it be

[start_pfn, buddy_start_pfn - 1]
[buddy_start_pfn, end_pfn - 1]

or

[start_pfn, buddy_start_pfn)
[buddy_start_pfn, end_pfn)

(I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) )

I actually prefer the first variant.

> 
> Hot-remove:
> 
>   We need to be careful when removing memory, as adding and
>   removing memory needs to be done with the same granularity.
>   To check that this assumption is not violated, we check the
>   memory range we want to remove and if a) any memory block has
>   vmemmap pages and b) the range spans more than a single memory
>   block, we scream out loud and refuse to proceed.
> 
>   If all is good and the range was using memmap on memory (aka vmemmap pages),
>   we construct an altmap structure so free_hugepage_table does the right
>   thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

[...]

>   /*
>    * online_page_callback contains pointer to current page onlining function.
>    * Initially it is generic_online_page(). If it is required it could be
> @@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order)
>   }
>   EXPORT_SYMBOL_GPL(generic_online_page);
>   
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long buddy_start_pfn)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn;
> +	unsigned long pfn = buddy_start_pfn;
> +
> +	/*
> +	 * When using memmap_on_memory, the range might be unaligned as the
> +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> +	 */
> +	if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {

if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES))

> +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> +		pfn += 1 << pageblock_order;

pfn += pageblock_nr_pages;

Can you add a comment why we can be sure that we are off by  a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?

Would it make thing simpler to just do a

while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
	(*online_page_callback)(pfn_to_page(pfn), 0);
	pfn++;
}

[...]

>   
>   	/*
>   	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> @@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	return memmap_on_memory_enabled &&
> +	       size == memory_block_size_bytes();

Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here.

>   
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			unsigned long nr_vmemmap_pages)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn, system_ram_pages = 0;
> +	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
>   	unsigned long flags;
>   	struct zone *zone;
>   	struct memory_notify arg;
> @@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
>   		return -EINVAL;
>   
> +	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
> +	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> +
>   	mem_hotplug_begin();
>   
>   	/*
> @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	zone_pcp_disable(zone);
>   
>   	/* set above range as isolated */
> -	ret = start_isolate_page_range(start_pfn, end_pfn,
> +	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
>   				       MIGRATE_MOVABLE,
>   				       MEMORY_OFFLINE | REPORT_FAILURE);

Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it.

>   	if (ret) {
> @@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	}
>   
>   	do {
> -		pfn = start_pfn;
> +		pfn = buddy_start_pfn;
>   		do {
>   			if (signal_pending(current)) {
>   				ret = -EINTR;
> @@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   		 * offlining actually in order to make hugetlbfs's object
>   		 * counting consistent.
>   		 */
> -		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
> +		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
>   		if (ret) {
>   			reason = "failure to dissolve huge pages";
>   			goto failed_removal_isolated;
>   		}
>   
> -		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> +		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
>   
>   	} while (ret);
>   
>   	/* Mark all sections offline and remove free pages from the buddy. */
> -	__offline_isolated_pages(start_pfn, end_pfn);
> +	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);>   	pr_debug("Offlined Pages %ld\n", nr_pages);
>   
>   	/*
> @@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	 * of isolated pageblocks, memory onlining will properly revert this.
>   	 */
>   	spin_lock_irqsave(&zone->lock, flags);
> -	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
> +	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   
>   	zone_pcp_enable(zone);
>   
>   	/* removal success */
> -	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> +	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
>   	zone->present_pages -= nr_pages;
>   
>   	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
>   	return 0;
>   }
>   

[...]

> +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> +{
> +	unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> +
> +	*nr_vmemmap_pages += mem->nr_vmemmap_pages;
> +	return mem->nr_vmemmap_pages;
> +}
> +

I think you can do this easier, all you want to know is if there
is any block that has nr_vmemmap_pages set - and return the value.

static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
{
	/* If not set, continue with the next block. */
	return mem->nr_vmemmap_pages;
}

...
> +		walk_memory_blocks(start, size, &nr_vmemmap_pages,
> +					 get_nr_vmemmap_pages_cb);

...

mem->nr_vmemmap_pages = walk_memory_blocks(start ...)

Looks quite promising, only a couple of things to fine-tune :) Thanks!

-- 
Thanks,

David / dhildenb