All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oscar Salvador <osalvador@suse.de>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Date: Wed, 21 Apr 2021 10:15:46 +0200	[thread overview]
Message-ID: <20210421081546.GD22456@linux> (raw)
In-Reply-To: <YH6zQ1Dty9kJFkuk@dhcp22.suse.cz>

On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> > 
> > This has some disadvantages:
> >  a) an existing memory is consumed for that purpose
> >     (eg: ~2MB per 128MB memory section on x86_64)
> 
> I would extend this slightly. This can even lead to extreme cases where
> system goes OOM because the physically hotplugged memory depletes the
> available memory before it is onlined.

Ok.

> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
> 
> Again this can be confusing because this is not what is really happening
> in practice because we are going to have a multisection memory block
> where all sections will be backed by a common reserved space rather than
> per section sparse space. I would go with
> 
> "
> Vmemap page tables can map arbitrary memory. That means that we can
> reserve a part of the physically hotadded memory to back vmemmap page
> tables. This implementation uses the beggining of the hotplugged memory
> for that purpose.
> "

Yeah, I thought I fixed that, it should have been "That means that we can simply
use the beginning of each memory block...", but I am ok with your rewording.

> There is quite a large leap from __populate_section_memmap to the
> memory_block that deserves explaining to not lose all the subtle things
> discussed in the past. I think it should be made clear why all the fuzz.
> I would structure it as follows:
> "
> There are some non-obiously things to consider though.  Vmemmap
> pages are allocated/freed during the memory hotplug events
> (add_memory_resource, try_remove_memory) when the memory is
> added/removed. This means that the reserved physical range is not online
> yet it is used. The most obvious side effect is that pfn_to_online_page
> returns NULL for those pfns. The current design expects that this
> should be OK as the hotplugged memory is considered a garbage until it
> is onlined. For example hibernation wouldn't save the content of those
> vmmemmaps into the image so it wouldn't be restored on resume but this
> should be OK as there no real content to recover anyway while metadata
> is reachable from other data structures (e.g. vmemmap page tables).
> 
> The reserved space is therefore (de)initialized during the {on,off}line
> events (mhp_{de}init_memmap_on_memory). That is done by extracting page
> allocator independent initialization from the regular onlining path.
> The primary reason to handle the reserved space outside of {on,off}line_pages
> is to make each initialization specific to the purpose rather than
> special case them in a single function.

Ok, that definitely adds a valuable information.

> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f209925a5d4e..2e2b2f654f0a 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
> >  {
> >  	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> >  	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> > +	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> > +	struct zone *zone;
> > +	int ret;
> > +
> > +	zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
> > +
> > +	/*
> > +	 * Although vmemmap pages have a different lifecycle than the pages
> > +	 * they describe (they remain until the memory is unplugged), doing
> > +	 * their initialization and accounting at memory onlining/offlining
> > +	 * stage simplifies things a lot.
> 
> "simplify things a lot" is not really helpful to people reading the
> code. It would be much better to state reasons here. I would go with
> 	 * stage helps to keep accounting easier to follow - e.g.
> 	 * vmemmaps belong to the same zone as the onlined memory.

Ok

> >  static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> >  {
> >  	const unsigned long end_pfn = start_pfn + nr_pages;
> > -	unsigned long pfn;
> > +	unsigned long pfn = start_pfn;
> > +
> > +	while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> > +		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > +		pfn += pageblock_nr_pages;
> > +	}
> 
> I believe we do not need to check for nr_pages as the actual operation
> will never run out of range in practice but the code is more subtle than

If you mean that IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES) can go, that is not
right.
Of course, with your changes below it would not be necesary.

> necessary. Using two different iteration styles is also hurting the code
> readability. I would go with the following
> 	for (pfn = start_pfn; pfn < end_pfn; ) {
> 		unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
> 
> 		while (start + (1UL << order) > end_pfn)
>                         order--;
> 		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> 		pfn += 1 << order;
> 	}
> 
> which is what __free_pages_memory does already.

this is kinda what I used to have in the early versions, but it was agreed
with David to split it in two loops to make it explicit.
I can go back to that if it is preferred.

> > +	if (memmap_on_memory) {
> > +		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > +						      get_nr_vmemmap_pages_cb);
> > +		if (nr_vmemmap_pages) {
> > +			if (size != memory_block_size_bytes()) {
> > +				pr_warn("Refuse to remove %#llx - %#llx,"
> > +					"wrong granularity\n",
> > +					start, start + size);
> > +				return -EINVAL;
> > +			}
> > +
> > +			/*
> > +			 * Let remove_pmd_table->free_hugepage_table do the
> > +			 * right thing if we used vmem_altmap when hot-adding
> > +			 * the range.
> > +			 */
> > +			mhp_altmap.alloc = nr_vmemmap_pages;
> > +			altmap = &mhp_altmap;
> > +		}
> > +	}
> > +
> >  	/* remove memmap entry */
> >  	firmware_map_remove(start, start + size, "System RAM");
> 
> I have to say I still dislike this and I would just wrap it inside out
> and do the operation from within walk_memory_blocks but I will not
> insist.

I have to confess I forgot about the details of that dicussion, as we were
quite focused on decoupling vmemmap pages from {online,offline} interface.
Would you mind elaborating a bit more?


-- 
Oscar Salvador
SUSE L3

  reply	other threads:[~2021-04-21  8:15 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-16 11:24 [PATCH v9 0/8] Allocate memmap from hotadded memory (per device) Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 1/8] drivers/base/memory: Introduce memory_block_{online,offline} Oscar Salvador
2021-04-20  9:23   ` Michal Hocko
2021-04-16 11:24 ` [PATCH v9 2/8] mm,memory_hotplug: Relax fully spanned sections check Oscar Salvador
2021-04-20  9:40   ` Michal Hocko
2021-04-21  7:37     ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count() Oscar Salvador
2021-04-20  9:45   ` Michal Hocko
2021-04-21  8:00     ` Oscar Salvador
2021-04-21  8:06       ` David Hildenbrand
2021-04-21  8:31       ` Michal Hocko
2021-04-21  8:35         ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range Oscar Salvador
2021-04-20 10:56   ` Michal Hocko
2021-04-21  8:15     ` Oscar Salvador [this message]
2021-04-21  8:39       ` Michal Hocko
2021-04-21  8:44         ` David Hildenbrand
2021-04-21  8:49           ` Michal Hocko
2021-04-21  8:52             ` David Hildenbrand
2021-04-21  8:46         ` Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory Oscar Salvador
2021-04-16 11:24 ` [PATCH v9 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Oscar Salvador
2021-04-20 10:56   ` Michal Hocko
2021-04-16 11:24 ` [PATCH v9 8/8] arm64/Kconfig: " Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210421081546.GD22456@linux \
    --to=osalvador@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.