Re: [PATCH 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug

From: Leonardo Bras <leobras.c@gmail.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	David Hildenbrand <david@redhat.com>,
	Scott Cheloha <cheloha@linux.ibm.com>,
	linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	Nicholas Piggin <npiggin@gmail.com>,
	Bharata B Rao <bharata@linux.ibm.com>,
	Paul Mackerras <paulus@samba.org>,
	Sandipan Das <sandipan@linux.ibm.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Mike Rapoport <rppt@kernel.org>
Subject: Re: [PATCH 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug
Date: Thu, 08 Apr 2021 23:51:36 -0300	[thread overview]
Message-ID: <e5c924479839815c025de29d650d82419b18c0dc.camel@gmail.com> (raw)
In-Reply-To: <YFhNd42RvobCV8tF@yekko.fritz.box>

Hello David, thanks for the feedback!

On Mon, 2021-03-22 at 18:55 +1100, David Gibson wrote:
> > +void hash_memory_batch_expand_prepare(unsigned long newsize)
> > +{
> > +	/*
> > +	 * Resizing-up HPT should never fail, but there are some cases system starts with higher
> > +	 * SHIFT than required, and we go through the funny case of resizing HPT down while
> > +	 * adding memory
> > +	 */
> > +
> > +	while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
> > +		newsize *= 2;
> > +		pr_warn("Hash collision while resizing HPT\n");
> 
> This unbounded increase in newsize makes me nervous - we should be
> bounded by the current size of the HPT at least.  In practice we
> should be fine, since the resize should always succeed by the time we
> reach our current HPT size, but that's far from obvious from this
> point in the code.

Sure, I will add bounds in v2.

> 
> And... you're doubling newsize which is a value which might not be a
> power of 2.  I'm wondering if there's an edge case where this could
> actually cause us to skip the current size and erroneously resize to
> one bigger than we have currently.

I also though that at the start, but it seems quite reliable.
Before using this value, htab_shift_for_mem_size() will always round it
to next power of 2. 
Ex.
Any value between 0b0101 and 0b1000 will be rounded to 0b1000 for shift
calculation. If we multiply it by 2 (same as << 1), we have that
anything between 0b01010 and 0b10000 will be rounded to 0b10000. 

This works just fine as long as we are multiplying. 
Division may have the behavior you expect, as 0b0101 >> 1 would become
0b010 and skip a shift.

> > +void memory_batch_expand_prepare(unsigned long newsize)
> 
> This wrapper doesn't seem useful.

Yeah, it does little, but I can't just jump into hash_* functions
directly from hotplug-memory.c, without even knowing if it's using hash
pagetables. (in case the suggestion would be test for disable_radix
inside hash_memory_batch*)

> 
> > +{
> > +	if (!radix_enabled())
> > +		hash_memory_batch_expand_prepare(newsize);
> > +}
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> > 
> > +	memory_batch_expand_prepare(memblock_phys_mem_size() +
> > +				     drmem_info->n_lmbs * drmem_lmb_size());
> 
> This doesn't look right.  memory_add_by_index() is adding a *single*
> LMB, I think using drmem_info->n_lmbs here means you're counting this
> as adding again as much memory as you already have hotplugged.

Yeah, my mistake. This makes sense.
I will change it to something like 
memblock_phys_mem_size() + drmem_lmb_size()

> > 
> > +	memory_batch_expand_prepare(memblock_phys_mem_size() + lmbs_to_add * drmem_lmb_size());
> > +
> >  	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
> >  		if (lmb->flags & DRCONF_MEM_ASSIGNED)
> >  			continue;
> 
> I don't see memory_batch_expand_prepare() suppressing any existing HPT
> resizes.  Won't this just resize to the right size for the full add,
> then resize several times again as we perform the add?  Or.. I guess
> that will be suppressed by patch 1/3. 

Correct.

>  That's seems kinda fragile, though.

What do you mean by fragile here?
What would you suggest doing different?

Best regards,
Leonardo Bras