All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: mhocko@kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, vbabka@suse.cz, pasha.tatashin@soleen.com
Subject: Re: [RFC PATCH 3/3] mm,memory_hotplug: Allocate memmap from the added memory range
Date: Tue, 17 Nov 2020 16:38:52 +0100	[thread overview]
Message-ID: <3cc37927-538e-ae7d-27bc-45aaabe06b3a@redhat.com> (raw)
In-Reply-To: <20201022125835.26396-4-osalvador@suse.de>

On 22.10.20 14:58, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>   a) an existing memory is consumed for that purpose
>      (eg: ~2MB per 128MB memory section on x86_64)
>   b) if the whole node is movable then we have off-node struct pages
>      which has performance drawbacks.
>   c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>      populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap. Once the memmap is
> allocated, we are going to need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_ON_MEMORY flag was passed, we will set up the layout of the
> altmap structure in add_memory_resouce(), and then we will call
> mhp_mark_vmemmap_pages() to properly mark those pages.
> 
> Online/Offline:
> 
>   In the memory_block structure, a new field is created in order to
>   store the number of vmemmap_pages.
>   Having that around simplifies things a lot since in {online/offline}_pages
>   we can know how much we have to skip forward until we have the first non-
>   vmemmap page, for operations like isolation/migration/initialization.
> 
> Hot-remove:
> 
>   If the range was using memmap on memory (aka vmemmap pages),
>   we construct an altmap structure so free_hugepage_table does
>   the right thing and calls vmem_altmap_free instead of
>   free_pagetable.
> 

Sorry for the late replay, fairly busy with all kinds of things.

[...]

> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index b02fd51e5589..6b57bf90ca72 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -195,7 +195,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>   			node = memory_add_physaddr_to_nid(info->start_addr);
>   
>   		result = __add_memory(node, info->start_addr, info->length,
> -				      MHP_NONE);
> +				      MEMHP_MEMMAP_ON_MEMORY);

I'd suggest moving that into a separate patch.

[...]

> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 439a89e758d8..7cc93de5856c 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -30,6 +30,7 @@ struct memory_block {
>   	int phys_device;		/* to which fru does this belong? */
>   	struct device dev;
>   	int nid;			/* NID for this memory block */
> +	unsigned long nr_vmemmap_pages;	/* Number for vmemmap pages */

Maybe also document that these pages are directly at the beginning of 
the memory block.


>   
> -static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +				unsigned long nr_vmemmap_pages)
>   {
>   	return -EINVAL;
>   }
> @@ -369,10 +372,12 @@ extern int add_memory_driver_managed(int nid, u64 start, u64 size,
>   				     mhp_t mhp_flags);
>   extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   				   unsigned long nr_pages,
> +				   unsigned long nr_vmemmap_pages,
>   				   struct vmem_altmap *altmap, int migratetype);
>   extern void remove_pfn_range_from_zone(struct zone *zone,
>   				       unsigned long start_pfn,
> -				       unsigned long nr_pages);
> +				       unsigned long nr_pages,
> +				       unsigned long nr_vmemmap_pages);

I think we should not pass nr_vmemmap_pages down here but instead do two 
separate calls to move_pfn_range_to_zone()/remove_pfn_range_from_zone() 
from online_pages()/offline_pages()

1. for vmemmap pages, migratetype = MIGRATE_UNMOVABLE
2. for remaining pages, migratetype = MIGRATE_ISOLATE

[...]

>   
>   		/* Select all remaining pages up to the next section boundary */
> @@ -629,6 +630,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
>   	unsigned long pfn;
> +	unsigned int order;
>   
>   	/*
>   	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
> @@ -636,8 +638,12 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
>   	 * later). We account all pages as being online and belonging to this
>   	 * zone ("present").
>   	 */
> -	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> -		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += (1 << order)) {
> +		order = MAX_ORDER - 1;
> +		while (pfn & ((1 << order) - 1))
> +			order--;
> +		(*online_page_callback)(pfn_to_page(pfn), order);
> +	}
>   
>   	/* mark all involved sections as online */
>   	online_mem_sections(start_pfn, end_pfn);
> @@ -697,6 +703,18 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
>   	pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
>   
>   }
> +
> +static void link_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
> +			       struct zone *zone, int nid)
> +{
> +	struct page *p = pfn_to_page(pfn);
> +	enum zone_type z = zone_idx(zone);
> +	unsigned long i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		set_page_links(p + i, z, nid, pfn);
> +}
> +
>   /*
>    * Associate the pfn range with the given zone, initializing the memmaps
>    * and resizing the pgdat/zone data to span the added pages. After this
> @@ -708,6 +726,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
>    */
>   void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   				  unsigned long nr_pages,
> +				  unsigned long nr_vmemmap_pages,
>   				  struct vmem_altmap *altmap, int migratetype)
>   {
>   	struct pglist_data *pgdat = zone->zone_pgdat;
> @@ -726,14 +745,22 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   	resize_pgdat_range(pgdat, start_pfn, nr_pages);
>   	pgdat_resize_unlock(pgdat, &flags);
>   
> +	if (nr_vmemmap_pages)
> +		/*
> +		 * Vmemmap pages are already initialized at hot-add state.
> +		 * What is left is to link them to a node and a zone.
> +		 */
> +		link_vmemmap_pages(start_pfn, nr_vmemmap_pages, zone, nid);
> +
>   	/*
>   	 * TODO now we have a visible range of pages which are not associated
>   	 * with their zone properly. Not nice but set_pfnblock_flags_mask
>   	 * expects the zone spans the pfn range. All the pages in the range
>   	 * are reserved so nobody should be touching them so we should be safe
>   	 */
> -	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
> -			 MEMINIT_HOTPLUG, altmap, migratetype);
> +	memmap_init_zone(nr_pages - nr_vmemmap_pages, nid, zone_idx(zone),
> +			 start_pfn + nr_vmemmap_pages, MEMINIT_HOTPLUG, altmap,
> +			 migratetype);
>   
>   	set_zone_contiguous(zone);
>   }
> @@ -796,9 +823,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>   }
>   
>   int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       int online_type, int nid)
> +		       unsigned long nr_vmemmap_pages, int online_type, int nid)
>   {
> -	unsigned long flags;
> +	unsigned long flags, valid_start_pfn, valid_nr_pages;
>   	struct zone *zone;
>   	int need_zonelists_rebuild = 0;
>   	int ret;
> @@ -809,11 +836,15 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>   			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
>   		return -EINVAL;
>   
> +	valid_start_pfn = pfn + nr_vmemmap_pages;
> +	valid_nr_pages = nr_pages - nr_vmemmap_pages;

Hm, valid sounds strange. More like "free_start_pfn" or "buddy_start_pfn".

> +
>   	mem_hotplug_begin();
>   
>   	/* associate pfn range with the zone */
>   	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> -	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
> +	move_pfn_range_to_zone(zone, pfn, nr_pages, nr_vmemmap_pages, NULL,
> +			       MIGRATE_ISOLATE);

As mentioned, I'd suggest properly initializing the memmap here

if (nr_vmemmap_pages) {
	move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
			       MIGRATE_UNMOVABLE);
}
move_pfn_range_to_zone(zone, valid_start_pfn, valid_nr_pages, NULL,
		       MIGRATE_ISOLATE);

[...]

>   /*
>    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>    * and online/offline operations (triggered e.g. by sysfs).
> @@ -1039,6 +1080,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   {
>   	struct mhp_params params = { .pgprot = PAGE_KERNEL };
> +	struct vmem_altmap mhp_altmap = {};
>   	u64 start, size;
>   	bool new_node = false;
>   	int ret;
> @@ -1065,18 +1107,36 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   		goto error;
>   	new_node = ret;
>   
> +	/*
> +	 * Check whether we support memmap_on_memory
> +	 */
> +	if (!support_memmap_on_memory(size))
> +		mhp_flags &= ~MEMHP_MEMMAP_ON_MEMORY;

Callers (e.g., virtio-mem) might rely on this. We should reject this 
with -EINVAL and provide a way for callers to test whether this flag is 
possible.

> +
> +	/*
> +	 * Self hosted memmap array
> +	 */
> +	if (mhp_flags & MEMHP_MEMMAP_ON_MEMORY) {
> +		mhp_altmap.free = size >> PAGE_SHIFT;
> +		mhp_altmap.base_pfn = start >> PAGE_SHIFT;
> +		params.altmap = &mhp_altmap;
> +	}
> +
>   	/* call arch's memory hotadd */
>   	ret = arch_add_memory(nid, start, size, &params);
>   	if (ret < 0)
>   		goto error;
>   
>   	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size);
> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
>   	if (ret) {
>   		arch_remove_memory(nid, start, size, NULL);
>   		goto error;
>   	}
>   
> +	if (mhp_flags & MEMHP_MEMMAP_ON_MEMORY)
> +		mhp_mark_vmemmap_pages(params.altmap);

Do we really still need that? Pages are offline, so we're messing with 
an invalid memmap. online_pages() should handle initializing the memmap 
of these pages.

[...]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e74ca22baaa1..043503fb8c6e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8761,6 +8761,13 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   	spin_lock_irqsave(&zone->lock, flags);
>   	while (pfn < end_pfn) {
>   		page = pfn_to_page(pfn);
> +		/*
> +		 * Skip vmemmap pages
> +		 */
> +		if (PageVmemmap(page)) {
> +			pfn += vmemmap_nr_pages(page);
> +			continue;
> +		}

I'd assume calling code can handle that and exclude isolating such pages.


-- 
Thanks,

David / dhildenb


  reply	other threads:[~2020-11-17 15:39 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-22 12:58 [RFC PATCH 0/3] Allocate memmap from hotadded memory (per device) Oscar Salvador
2020-10-22 12:58 ` [RFC PATCH 1/3] mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY Oscar Salvador
2020-10-22 13:04   ` David Hildenbrand
2020-10-22 12:58 ` [RFC PATCH 2/3] mm: Introduce a new Vmemmap page-type Oscar Salvador
2020-11-20 11:20   ` David Hildenbrand
2020-10-22 12:58 ` [RFC PATCH 3/3] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2020-11-17 15:38   ` David Hildenbrand [this message]
2020-11-19 10:48     ` Oscar Salvador
2020-11-20  9:31       ` David Hildenbrand
2020-10-22 13:01 ` [RFC PATCH 0/3] Allocate memmap from hotadded memory (per device) David Hildenbrand
2020-10-27 15:40   ` Oscar Salvador
2020-10-27 15:44     ` David Hildenbrand
2020-10-27 15:58       ` Oscar Salvador
2020-10-28 18:47         ` Mike Kravetz
2020-10-29  7:49           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cc37927-538e-ae7d-27bc-45aaabe06b3a@redhat.com \
    --to=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.