Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation

From: Roman Gushchin <guro@fb.com>
To: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation
Date: Mon, 29 Mar 2021 13:10:10 -0700	[thread overview]
Message-ID: <YGI0IsSjGpqomJxh@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <YGIotHUFTrPauwrP@google.com>

On Mon, Mar 29, 2021 at 07:21:24PM +0000, Dennis Zhou wrote:
> On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> > To return unused memory to the system schedule an async
> > depopulation of percpu chunks.
> > 
> > To balance between scanning too much and creating an overhead because
> > of the pcpu_lock contention and scanning not enough, let's track an
> > amount of chunks to scan and mark chunks which are potentially a good
> > target for the depopulation with a new boolean flag.  The async
> > depopulation work will clear the flag after trying to depopulate a
> > chunk (successfully or not).
> > 
> > This commit suggest the following logic: if a chunk
> >   1) has more than 1/4 of total pages free and populated
> >   2) isn't a reserved chunk
> >   3) isn't entirely free
> >   4) isn't alone in the corresponding slot
> 
> I'm not sure I like the check for alone that much. The reason being what
> about some odd case where each slot has a single chunk, but every slot
> is populated. It doesn't really make sense to keep them all around.

Yeah, I agree, I'm not sure either. Maybe we can just look at the total
number of populated empty pages and make sure it's not too low and not
too high. Btw, we should probably double PCPU_EMPTY_POP_PAGES_LOW/HIGH
if memcg accounting is on.

> 
> I think there is some decision making we can do here to handle packing
> post depopulation allocations into a handful of chunks. Depopulated
> chunks could be sidelined with say a flag ->depopulated to prevent the
> first attempt of allocations from using them. And then we could bring
> back a chunk 1 by 1 somehow to attempt to suffice the allocation.
> I'm not too sure if this is a good idea, just a thought.

I thought about it in this way: depopulated chunks are not different to
new chunks, which are not yet fully populated. And they are naturally
de-prioritized by being located in higher slots (and at the tail of the list).
So I'm not sure we should handle them any special.

> 
> > it's a good target for depopulation.
> > 
> > If there are 2 or more of such chunks, an async depopulation
> > is scheduled.
> > 
> > Because chunk population and depopulation are opposite processes
> > which make a little sense together, split out the shrinking part of
> > pcpu_balance_populated() into pcpu_grow_populated() and make
> > pcpu_balance_populated() calling into pcpu_grow_populated() or
> > pcpu_shrink_populated() conditionally.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/percpu-internal.h |   1 +
> >  mm/percpu.c          | 111 ++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 85 insertions(+), 27 deletions(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index 18b768ac7dca..1c5b92af02eb 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -67,6 +67,7 @@ struct pcpu_chunk {
> >  
> >  	void			*data;		/* chunk data */
> >  	bool			immutable;	/* no [de]population allowed */
> > +	bool			depopulate;	/* depopulation hint */
> >  	int			start_offset;	/* the overlap with the previous
> >  						   region to have a page aligned
> >  						   base_addr */
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 015d076893f5..148137f0fc0b 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
> >   */
> >  int pcpu_nr_empty_pop_pages;
> >  
> > +/*
> > + * Track the number of chunks with a lot of free memory.
> > + * It's used to release unused pages to the system.
> > + */
> > +static int pcpu_nr_chunks_to_depopulate;
> > +
> >  /*
> >   * The number of populated pages in use by the allocator, protected by
> >   * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
> > @@ -1955,6 +1961,11 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
> >  		if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
> >  			continue;
> >  
> > +		if (chunk->depopulate) {
> > +			chunk->depopulate = false;
> > +			pcpu_nr_chunks_to_depopulate--;
> > +		}
> > +
> >  		list_move(&chunk->list, &to_free);
> >  	}
> >  
> > @@ -1976,7 +1987,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
> >  }
> >  
> >  /**
> > - * pcpu_balance_populated - manage the amount of populated pages
> > + * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
> >   * @type: chunk type
> >   *
> >   * Maintain a certain amount of populated pages to satisfy atomic allocations.
> > @@ -1985,35 +1996,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
> >   * allocation causes the failure as it is possible that requests can be
> >   * serviced from already backed regions.
> >   */
> > -static void pcpu_balance_populated(enum pcpu_chunk_type type)
> > +static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
> >  {
> >  	/* gfp flags passed to underlying allocators */
> >  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> >  	struct list_head *pcpu_slot = pcpu_chunk_list(type);
> >  	struct pcpu_chunk *chunk;
> > -	int slot, nr_to_pop, ret;
> > +	int slot, ret;
> >  
> > -	/*
> > -	 * Ensure there are certain number of free populated pages for
> > -	 * atomic allocs.  Fill up from the most packed so that atomic
> > -	 * allocs don't increase fragmentation.  If atomic allocation
> > -	 * failed previously, always populate the maximum amount.  This
> > -	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> > -	 * failing indefinitely; however, large atomic allocs are not
> > -	 * something we support properly and can be highly unreliable and
> > -	 * inefficient.
> > -	 */
> >  retry_pop:
> > -	if (pcpu_atomic_alloc_failed) {
> > -		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> > -		/* best effort anyway, don't worry about synchronization */
> > -		pcpu_atomic_alloc_failed = false;
> > -	} else {
> > -		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> > -				  pcpu_nr_empty_pop_pages,
> > -				  0, PCPU_EMPTY_POP_PAGES_HIGH);
> > -	}
> > -
> >  	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
> >  		unsigned int nr_unpop = 0, rs, re;
> >  
> > @@ -2084,9 +2075,18 @@ static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> 
> I missed this in the review of patch 1, but pcpu_shrink only needs to
> iterate over:
> for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {

You mean skip first few slots?
Yeah, it's probably safe. I was afraid that a marked chunk can be moved to
one of such slots, so we'll never find it and will repeat scanning, but it seems
like it's not a possible scenario. Will adjust, thanks.

> 
> >  		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> >  			bool isolated = false;
> >  
> > -			if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > +			if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH ||
> > +			    pcpu_nr_chunks_to_depopulate < 1)
> >  				break;
> >  
> > +			/*
> > +			 * Don't try to depopulate a chunk again and again.
> > +			 */
> > +			if (!chunk->depopulate)
> > +				continue;
> > +			chunk->depopulate = false;
> > +			pcpu_nr_chunks_to_depopulate--;
> > +
> >  			for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> >  				if (!chunk->nr_empty_pop_pages)
> >  					break;
> > @@ -2153,6 +2153,41 @@ static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> >  	spin_unlock_irq(&pcpu_lock);
> >  }
> >  
> > +/**
> > + * pcpu_balance_populated - manage the amount of populated pages
> > + * @type: chunk type
> > + *
> > + * Populate or depopulate chunks to maintain a certain amount
> > + * of free pages to satisfy atomic allocations, but not waste
> > + * large amounts of memory.
> > + */
> > +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> > +{
> > +	int nr_to_pop;
> > +
> > +	/*
> > +	 * Ensure there are certain number of free populated pages for
> > +	 * atomic allocs.  Fill up from the most packed so that atomic
> > +	 * allocs don't increase fragmentation.  If atomic allocation
> > +	 * failed previously, always populate the maximum amount.  This
> > +	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> > +	 * failing indefinitely; however, large atomic allocs are not
> > +	 * something we support properly and can be highly unreliable and
> > +	 * inefficient.
> > +	 */
> > +	if (pcpu_atomic_alloc_failed) {
> > +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> > +		/* best effort anyway, don't worry about synchronization */
> > +		pcpu_atomic_alloc_failed = false;
> > +		pcpu_grow_populated(type, nr_to_pop);
> > +	} else if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) {
> > +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages;
> > +		pcpu_grow_populated(type, nr_to_pop);
> > +	} else if (pcpu_nr_chunks_to_depopulate > 0) {
> > +		pcpu_shrink_populated(type);
> > +	}
> > +}
> > +
> >  /**
> >   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> >   * @work: unused
> > @@ -2188,6 +2223,7 @@ void free_percpu(void __percpu *ptr)
> >  	int size, off;
> >  	bool need_balance = false;
> >  	struct list_head *pcpu_slot;
> > +	struct pcpu_chunk *pos;
> >  
> >  	if (!ptr)
> >  		return;
> > @@ -2207,15 +2243,36 @@ void free_percpu(void __percpu *ptr)
> >  
> >  	pcpu_memcg_free_hook(chunk, off, size);
> >  
> > -	/* if there are more than one fully free chunks, wake up grim reaper */
> >  	if (chunk->free_bytes == pcpu_unit_size) {
> > -		struct pcpu_chunk *pos;
> > -
> > +		/*
> > +		 * If there are more than one fully free chunks,
> > +		 * wake up grim reaper.
> > +		 */
> >  		list_for_each_entry(pos, &pcpu_slot[pcpu_nr_slots - 1], list)
> >  			if (pos != chunk) {
> >  				need_balance = true;
> >  				break;
> >  			}
> > +
> > +	} else if (chunk->nr_empty_pop_pages > chunk->nr_pages / 4) {
> 
> We should have this ignore the first and reserved chunks. While it
> shouldn't be possible in theory, it would be nice to just make it
> explicit here.

Ok, will do, makes sense to me!

> 
> > +		/*
> > +		 * If there is more than one chunk in the slot and
> > +		 * at least 1/4 of its pages are empty, mark the chunk
> > +		 * as a target for the depopulation. If there is more
> > +		 * than one chunk like this, schedule an async balancing.
> > +		 */
> > +		int nslot = pcpu_chunk_slot(chunk);
> > +
> > +		list_for_each_entry(pos, &pcpu_slot[nslot], list)
> > +			if (pos != chunk && !chunk->depopulate &&
> > +			    !chunk->immutable) {
> > +				chunk->depopulate = true;
> > +				pcpu_nr_chunks_to_depopulate++;
> > +				break;
> > +			}
> > +
> > +		if (pcpu_nr_chunks_to_depopulate > 1)
> > +			need_balance = true;
> >  	}
> >  
> >  	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
> > -- 
> > 2.30.2
> > 
> 
> Some questions I have:
> 1. How do we prevent unnecessary scanning for atomic allocations?

Depopulated chunks tend to be at tail of the chunks lists in high(er) slots,
so they seem to be the last target for an atomic allocation, if there is
enough space in other chunks.

> 2. Even in the normal case, should we try to pack future allocations
> into a smaller # of chunks in after depopulation?

Well, there is one specific problem I'm trying to solve: if the percpu memory
is heavily inflated by something (e.g. by creating a ton of cgroups or bpf maps),
sometimes it's impossible to get the memory back at all, even if the absolute
majority of percpu objects was released. In this case there are many chunks
which are almost entirely empty, and the actual size of the needed percpu memory
is way less that the number of populated pages.

We can look at the percpu memory fragmentation as a more fundamental problem,
and probably the right thing to do long-term is to introduce some sort of a
slab allocator. At least, we can put small allocations (aka percpu reference
counters) into separate chunks. But this is obviously a way bigger change,
which unlikely can do into any stable branches, so I'd treat it separately.
Also, I'm not convinced that we really need it so much at the moment.
If the percpu usage is more or less stable, I don't see any pathological
fragmentation problem.

> 3. What is the right frequency to do depopulation scanning? I think of
> the pcpu work item as a way to defer the 2 the freeing of chunks and in
> a way more immediately replenish free pages. Depopulation isn't
> necessarily as high a priority.

I think that the number of chunks which are potentially good for the
depopulation is a good metric. I've chosen 2 as a threshold, but I'm
fine with other ideas as well.

Thank you!